Claude Opus 4.7: Is it the strongest model
Opus 4.7 was a release that was not intended to be the "best model" at all, and it was a publication with a clear trade-off, a "precision knife" styleã

Original title: "Opus 4.7 doesn't want to be the strongest model: you can't keep up with Anthropic"
Original source: Silicon Starman Pro
On April 16, 2026, Anthropic officially released Claude Opus 4.7, just over two months from the last generation of Opus 4.6ã
After a recent wave of intensive and crazy product and model updates, the Anthropic that threw out the new model naturally gives people a sense of magnification. And you've seen a lot of first-time model reports combing, all of which refer to Opus 4.7 as the "most powerful model" -- "people are dead" and "unemployment warning" and so onã
But let's see what Anthropic sent himselfã
The tone of this release is not really normalã
Anthropic writes directly in the bulletin that Opus 4.7 has less capacity than Claude Mythos Preview - and Mythos is open only to a few partners such as Apple, Google, Microsoft, Nvidia, and is not available to ordinary developers and usersã
At the same time, what is more interesting than its rhetoric is that it is not only weaker than Mythos, as a legend, but it is also weaker than the previous generation's model, but also in some of its key capabilitiesã
Opus 4.7, an unusual number from his own runout:MRCR v2@1M from 78.3% of Opus 4.6 down to 32.2%46 percentage points of sharp declineã
Very few flagship models are able to cut off half of their ability to be an aceã
And that's its choiceã
So, as you continue with your brainless inertia, every model that blows it is the strongest, it's not keeping up with Anthropic's own rhythm

it doesn't even care to fix this car wash
Opus 4.7 is a publication that has no intention of being the "most powerful model" â a clear trade-off, a "precision knife"-style release that differs from the various ideas of previous head model manufacturers, and a new direction that todayâs head manufacturers will collectively move in when they clearly feel that the model itselfâs "big leap" is no longer sustainable â Anthropic is already, to some extent, aligned with the marketing strategies of Apple, Microsoft, and others at the stage of their very mature productã
It's probably where 4.7 really mattersã
I. Programming capacity: real improvements behind numbers
The best way to understand these changes better is naturally to take a closer look at what it actually does this timeã
Here is the complete information combo that Opus 4.7 released this time -- where progress has been made, where has it been ruined, what was the first-hand feedback from the developers, should it be movedã
Official bulletin:https://www.antropic.com/news/claude-opus-4-7
The programming achievement of Opus 4.7 was the main axis of this releaseã

SWE-bench Verified(500 real GitHub issues, models need to write patches that pass the test) from 80.8% of Opus 4.6 to 87.6%, close to 7 percentage points, is the first of the currently publicly available models. Compared to Gemini 3.1 Pro 80.6%, the gap is significantã
SWE-bench ProIt is a more difficult version, covering complete engineering flow lines in four programming languages. Opus 4.7 jumped from 53.4% to 64.3%, 11 percentage points. Compared to 57.7 per cent of GPT-54, 54.2 per cent of Gemini 3.1 Pro, Opus 4.7 is clearly ahead of this benchmarkã
Cursor BenchIt is a field-based benchmark from Cursor, which specifically measures the programming support quality of the model in a real IDE environment. Opus 4.6 is 58%, Opus 4.7 jumped to 70%, 12 percentage points. The co-founder of Cursor, Michael Truell, in his official bulletin, said, "This is a meaningful leap in capacity, with more creative reasoning in solving difficult problems."
Partner measurements:
Rakuten:Opus 4.7 solved three times as many production tasks as Opus 4.6, with a double-digit increase in code quality and test quality
âĒ Fact:Mission success rates increased 10-15 per cent, and the number of model stops decreased significantly
âĒ Regulation(Devin behind company): The model "may work for hours without losing the line."
âĒ CodeRabbit:Recall rate increased by more than 10%, "a little faster than GPT-5.4 xhigh"
âĒ Bolt:On a longer application builder mission, Opus 4.7 is 4.6
I don't know Terminal-Bench 2.0:Opus 4.7 solved three tasks that were previously unmanageable by Claude Model (or competitor), one of which required multi-file reasoning across a complex code library to repair competitive conditions (race condition)

These data are concentrated in one direction: Opus 4.7 has clearly improved in complex programming tasks that are long-term, cross-documentary and require context-consistency. And that's exactly the point where the users have dropped the most in the last two months -- when the job is done half-way, when they encounter multiple files, when they get lostã
Visualization: the most undervalued improvement of the launch
Visual accuracy benchmarkXBOW JUMPED FROM 54.5% TO 98.5%ãThis is not a gradual improvement, but a leap forward at the level of reconstructionã
Specific specification changes:
I don't knowMaximum image resolution increased from about 1.15 million pixels (long edge 1,568 pixels) to about 3.75 million pixels (long edge 2,576 pixels), more than three times the previous generation
I don't knowModel coordinates and actual pixels achieved1:1 Counterpartbefore a task requires manual conversion of the scaling factor, this step disappears
I don't knowCharXiv Visual reasoning benchmark: no tools 82.1%, tools 91.0%

What kind of scene does this have a real impact on
For the product team, this upgrade could be decisive. Opus 4.6-era computer use is in a state of "capable but afraid to produce" -- too high an error rate to predict. A visual accuracy of 98.5 per cent means that for the first time this function has a threshold for reliable deployment. In the evaluation, a number of technical bloggers wrote: "If you set aside the Computer use product program because of the high frequency of the Opus 4.6 error, 4.7 removed this barrierã
First-hand feedback on Reddit(r/ClaudeAI): The user mentioned that "the improvement of visual ability is too critical, and I have done a lot of marginal projects before, trying to get models to improve their output in the visual feedback cycle, which has been confusing and very much in anticipation of how 4.7 can deal with it."
In addition to Computer use, benefits include scanning document analysis (reading smaller fonts, identifying more sophisticated chart details), amplification understanding, dashboard type application, complex PDF processingã
Cost issues requiring attention:higher resolution images consume more token. if the application scene does not require a detailed picture, it is recommended that samples be taken before they are uploadedã

III. The greatest setback: the long context has collapsed
MRRR v2@1M(millions of token context memory tests):
I don't know4.6:78.3%
I don't know4.7:32.2 per cent
The collapse of 46 percentage points, from nearly 80 per cent to a thirdã
This drop has little precedent in the history of flagship models. MRCR v2 is the ability of Anthropic himself to be highlighted in the Opus 4.6 era, when Anthropic was called "a qualitative change in the context mass scale that a model actually works." By 4.7, this "mass transformation" just disappearedã
Why? Tokenizer changedã
Opus 4.7 With new tokenizer, the same input text will be generated Approximately1.0-1.35 Doublethe number of tokens, the number of which varies according to the type of contentã
The direct chain reaction is:
I don't knowTHE CONTEXT WINDOW FOR 200K/1M IS STILL AVAILABLE IN NOMINAL TERMS, BUT THE SAME TEXT IS LESS LOADED
I don't knowactual token consumption increased by approximately 35 per cent for long assignments angent workflow
I don't knowpricing unchanged ($5, output $25 per million token) but actual usage costs increased
The official version of Anthropic is that the new tokenizer "improves the efficiency of text processing", but the benchmark data show a marked regression in the context of a long contextã
The search capability is also down:
I don't knowBrowneComp (web in-depth access): 83.7% of Opus 4.6
I don't knowGPT-5.4 Pro score 89.3%, Gemini 3.1 Pro score 85.9%, Opus 4.7 currently in the main competition model Bottom
Search and long text are the most common scene for many business usersã
Firsthand feedback from developers on Hacker News (poster 275, comment 215, source: HN discussion):
"to turn off the offensive thinking and pull effort manually to the top to get me back to the baseline. "Our internal assessment looks good" is not enough, and everyone sees the same problem. ""4.7 Default no longer contains human readable reasoning token digests in output, which must be returned by requesting Riga display: returned."
These are issues that are reflected by actual users. But this is also the choice that Anthropic made on his own initiativeã
IV. New behaviour characteristics: self-validation and more literally following instructions
Opus 4.7 The official announcement contains a single statement worth taking:The model verifies its output before reporting the resultsã
Hex's technical team gave a specific case in the test: when data is missing, Opus 4.7 reports as if the data did not exist, rather than giving an answer that seems reasonable but actually is fictional -- the latter is the pit on which Opus 4.6 stepped. The financial technology platform, Block, is about to say, "It can detect its own logical errors at the planning stage, speed up its implementation, and outpace the old Claude model."
But self-certification brings with it another associated behavioural change: Opus 4.7's interpretation of the command is more literallyã
This is an important migration risk. If you've carefully drawn prompt, 4.7 for Opus 4.6, it's probably not "read out" like 4.6, but it's done strictly according to what you wrote. Anthropic explicitly mentioned this in the official migration guide, suggesting that the key prompt be tested for regression before going online 4.7ã
A functional reference number from the CTO in Hex:Opus 4.7 for low effort, performance is approximately equal to Opus 4.6 for medium effortã
v. elimination control mechanisms: xheigh, task bugs and /ultrareview
Opus 4.6 There was an event that affected the user's trust: February 9 to the default model for adaptive thinking, March 3 the official shift the default reasoning depth of Claude Code from the top to mediaium on the grounds of "balance between intelligence, delay, cost." This matter was referred to by the user as "the Deceptive Gate", and the question from a senior director in GitHub was widely transmittedã
Opus 4.7 responded by giving control over the depth of the reasoning more visibly to the userã
xhigh effort: The new level of reasoning strength, between the original high and max. Claude Code has now updated all planned default slots to xheighã
But the developer community has a direct question about xheigh, and Reddit user's original words are: "Opus 4.6 Default is medium, 4.7 Default is xheigh. I want to know what's going on behind this decision, because it's obvious that an increase in the effort file will lead to more token consumption."
in other words, the user saw a "return control to the user" fix, but the default file was actually raised, meaning that the same task was set to burn more token. add tokenizer changes, which is a double cost increaseã
task bugsToken budget control mechanism for long missions. The developer sets a total token budget (minimum of 20K), which allows the model to see the remaining amount in real time during implementation, thereby allocating resources, avoiding stopping to half because of token overspent and preventing unnecessary computing wasteã
Claude Code adds /ultrareview command:Special code review sessions, run an in-depth review focusing on bug search and design issues, and Pro and Max users give them three free times a monthã
auto mode is open to Max users:Previously only in the Enterprise program, now Max users can use it. Claude is free to make decisions and to reduce the number of queries to users. Claude Code team leader Boris Cherny said, "Give Claude a mission, let him run, come back and see what's done."
VI. RUN: Where to win, where to lose
The following are the main baseline data currently available (source: Anthropic official system card and partner assessment)ã
Programming and Engineering (Opus 4.7 lead)

Visual and multi-modular (Opus 4.7 by far)

Knowledge work (Opus 4.7 lead)

Comprehensive assessment (Opus 4.7 clearly above step)

General reasoning (three basic flats)

This benchmark has saturated and is no longer an effective competitive watershedã
Research tasks (GPT-54 lead, Opus 4.7 retreat)

Long Context (Opus 4.7 Substantial regression)

Summarizing the selection logic: four areas of programming, engineering agent, visual, financial legal literacy, Opus 4.7 have clear advantages; research-intensive missions and open web search GPT-5.4 are stronger; and in the context of the context, Opus 4.7 is far less than its predecessors, the most alarming pointã
VII. Security fence: paved stone of Mythos
This part can easily be used as a "security routine statement" in the release, but it is the key to understanding Anthropic's current strategyã
On 7 April, Anthropic announced Project Glasswing: Opening Claude Mythos Preview to Apple, Google, Microsoft, Nvidia, Amazon, Cisco, CrowdStrike, JP Morgan Chase and Broadcom nine partners dedicated to defensive cyber security scenariosã
Mythos is the most powerful model of Anthropic so far, and according to The Hacker News, it is able to detect zero-day holes on its own and find thousands of previously unknown holes in the main operating systems and browsers. However, precisely because of this ability, it has also been found to carry a significant risk of abuse and is not publicly availableã
Opus 4.7 was the first test sample on this lineãAt the training stage, Anthropic took the initiative to reduce the model ' s cybersecurity attack capability (while retaining as much of its defence capability as possible) and went online with a real-time guard system to automatically detect and intercept high-risk network security requests. The text of the bulletin: "We will learn from the actual deployment of Opus 4.7 about the effectiveness of the fence and then decide whether to extend it to Mythos."
In other words, every developer using Opus 4.7 is helping Anthropic to demarcate the security fenceã
Gizmodo's evaluation:The launch adopted the "Bold marketing strategy - Proactive promotion of new models of self-employment" with less general capability than other options, which is rare in flagship releasesã
Safety practitioners who need to use Opus 4.7 for legal penetration tests, gap studies or red team tests need to apply for Cyber Verification Programmeã
VIII. Prices and migration: no change in nominal terms, real increase
Pricing:Enter $5 million token, output $25/million token, same as Opus 4.6. The API model ID is claude-opus-4-7. Available platforms include Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundation, GitHub Copilot is also on lineã
but, as mentioned earlier, the tokenizer changes make the same input approximate1.0-1.35double the number of tokens, superseding higher default thoughts under the effort slot tokenThe actual cost for a long mission angent workstream may be 2-3 times the same settings in Opus 4.6ã
Anthropic also reduced Claude Code's cache from one hour to five minutes -- This means that if you leave the computer for more than five minutes and come back, the context cache will fail, you have to reload, and token will consume faster. The Reddit community already has a lot of users who "burn faster than falls."ã
The list of destructive changes to existing Opus 4.6 users:
Extended Thinking Budgets Parameters removed, sent back 400 errors, need to be changed to an advanced thinking mode
2. sampling parameters such as temperature (temperature), top p, top k have been removed and control of output is required by prompting
Stricter text-based command following - modified prompt for Opus 4.6 needs to be retested and cannot be replaced directly with a model ID on line
4. tokenizer changes result in token count changes, and it is recommended to run samples on real traffic before full migration
default output no longer contains reasoning token summary and requires visible setting to get it back
Practical recommendations:The Anthropic official migration guide recommends that the official switch run Opus 4.7 with representative production flows before deciding on token consumption and mission qualityã
It's the most scary way to release a precise knife
Opus 4.7 is an upgrade with a clear target direction and an upgrade at a clear cost. And these are all Anthony's designs, and you have to pay for them to a large extentã
On the progressive side of this model:
I don't know87.6% of SWE-bench Verified, 64.3% of SWE-bench Pro, 70% of Cursor Bench, 3 times the task of Rakuten - These are programmable improvements that are felt in the production environment
I don't knowVisual redevelopment (XBOW 54.5% 98.5%, resolution 3 times, pixel 1:1), allowing for the first time the threshold for reliable deployment
I don't knowxhigh, tsk buttons, /ultrareview, is a visible response to "tructure."
I don't knowBigLaw 90.9 per cent, Finance Authority 64.4 per cent, with a clear lead in such expertise as financial law
Give up the side:
I don't knowMRCR v2@1M from 78.3% to 32.2%, with almost half the contextal ability
I don't knowBrownecomp dropped from 83.7% to 79.3%, and search capabilities were double-crossed by GPT-54 and Gemini 3.1 Pro
I don't knowtokenizer Change + Default effort high + Cache TTL Short = Triple Invisible Price Increase
I don't knowMythos kept pressing, which means Anthropic still has a bigger card but can't do it
This time, the real thing is not "the strongest model" or "the strongest open model" but:One with a clear trade-offã
The latest news is that Claude Code's annualized income in February has reached $2.5 billion. Opus 4.7 is the next bet on this lineã
Programming and visualization are added, long context and search are reduced, and prices remain nominal but bills are rising. Anthropic is balancing with Opus 4.7 - both to repair the trust damage left over from Opus 4.6 and to perform a field exercise of security fences for the greater future of Mythos-class models. And, more importantly, it needs to take full advantage of the lead it is in today, turn usersâ preferences for its products into inertia that remains indispensable to a generation of products, even if they are defective, and then build a lovable and hateful user viscerality like apples, and truly commercially valuable ecologyã
