Original title: "Opus 4.7 doesn't want to be the strongest model: you can't keep up with Anthropic"

Original source: Silicon Starman Pro

On April 16, 2026, Anthropic officially released Claude Opus 4.7, just over two months from the last generation of Opus 4.6。

After a recent wave of intensive and crazy product and model updates, the Anthropic that threw out the new model naturally gives people a sense of magnification. And you've seen a lot of first-time model reports combing, all of which refer to Opus 4.7 as the "most powerful model" -- "people are dead" and "unemployment warning" and so on。

But let's see what Anthropic sent himself。

The tone of this release is not really normal。

Anthropic writes directly in the bulletin that Opus 4.7 has less capacity than Claude Mythos Preview - and Mythos is open only to a few partners such as Apple, Google, Microsoft, Nvidia, and is not available to ordinary developers and users。

At the same time, what is more interesting than its rhetoric is that it is not only weaker than Mythos, as a legend, but it is also weaker than the previous generation's model, but also in some of its key capabilities。

Opus 4.7, an unusual number from his own runout:MRCR v2@1M from 78.3% of Opus 4.6 down to 32.2%46 percentage points of sharp decline。

Very few flagship models are able to cut off half of their ability to be an ace。

And that's its choice。

So, as you continue with your brainless inertia, every model that blows it is the strongest, it's not keeping up with Anthropic's own rhythm

it doesn't even care to fix this car wash

Opus 4.7 is a publication that has no intention of being the "most powerful model" – a clear trade-off, a "precision knife"-style release that differs from the various ideas of previous head model manufacturers, and a new direction that today’s head manufacturers will collectively move in when they clearly feel that the model itself’s "big leap" is no longer sustainable – Anthropic is already, to some extent, aligned with the marketing strategies of Apple, Microsoft, and others at the stage of their very mature product。

It's probably where 4.7 really matters。

I. Programming capacity: real improvements behind numbers

The best way to understand these changes better is naturally to take a closer look at what it actually does this time。

Here is the complete information combo that Opus 4.7 released this time -- where progress has been made, where has it been ruined, what was the first-hand feedback from the developers, should it be moved。

Official bulletin:https://www.antropic.com/news/claude-opus-4-7

The programming achievement of Opus 4.7 was the main axis of this release。

SWE-bench Verified(500 real GitHub issues, models need to write patches that pass the test) from 80.8% of Opus 4.6 to 87.6%, close to 7 percentage points, is the first of the currently publicly available models. Compared to Gemini 3.1 Pro 80.6%, the gap is significant。

SWE-bench ProIt is a more difficult version, covering complete engineering flow lines in four programming languages. Opus 4.7 jumped from 53.4% to 64.3%, 11 percentage points. Compared to 57.7 per cent of GPT-54, 54.2 per cent of Gemini 3.1 Pro, Opus 4.7 is clearly ahead of this benchmark。

Cursor BenchIt is a field-based benchmark from Cursor, which specifically measures the programming support quality of the model in a real IDE environment. Opus 4.6 is 58%, Opus 4.7 jumped to 70%, 12 percentage points. The co-founder of Cursor, Michael Truell, in his official bulletin, said, "This is a meaningful leap in capacity, with more creative reasoning in solving difficult problems."

Partner measurements:

Rakuten:Opus 4.7 solved three times as many production tasks as Opus 4.6, with a double-digit increase in code quality and test quality

• Fact:Mission success rates increased 10-15 per cent, and the number of model stops decreased significantly

• Regulation(Devin behind company): The model "may work for hours without losing the line."

• CodeRabbit:Recall rate increased by more than 10%, "a little faster than GPT-5.4 xhigh"

• Bolt:On a longer application builder mission, Opus 4.7 is 4.6

I don't know Terminal-Bench 2.0:Opus 4.7 solved three tasks that were previously unmanageable by Claude Model (or competitor), one of which required multi-file reasoning across a complex code library to repair competitive conditions (race condition)

These data are concentrated in one direction: Opus 4.7 has clearly improved in complex programming tasks that are long-term, cross-documentary and require context-consistency. And that's exactly the point where the users have dropped the most in the last two months -- when the job is done half-way, when they encounter multiple files, when they get lost。

Visualization: the most undervalued improvement of the launch

Visual accuracy benchmarkXBOW JUMPED FROM 54.5% TO 98.5%。This is not a gradual improvement, but a leap forward at the level of reconstruction。

Specific specification changes:

I don't knowMaximum image resolution increased from about 1.15 million pixels (long edge 1,568 pixels) to about 3.75 million pixels (long edge 2,576 pixels), more than three times the previous generation

I don't knowModel coordinates and actual pixels achieved1:1 Counterpartbefore a task requires manual conversion of the scaling factor, this step disappears

I don't knowCharXiv Visual reasoning benchmark: no tools 82.1%, tools 91.0%

What kind of scene does this have a real impact on

For the product team, this upgrade could be decisive. Opus 4.6-era computer use is in a state of "capable but afraid to produce" -- too high an error rate to predict. A visual accuracy of 98.5 per cent means that for the first time this function has a threshold for reliable deployment. In the evaluation, a number of technical bloggers wrote: "If you set aside the Computer use product program because of the high frequency of the Opus 4.6 error, 4.7 removed this barrier。

First-hand feedback on Reddit(r/ClaudeAI): The user mentioned that "the improvement of visual ability is too critical, and I have done a lot of marginal projects before, trying to get models to improve their output in the visual feedback cycle, which has been confusing and very much in anticipation of how 4.7 can deal with it."

In addition to Computer use, benefits include scanning document analysis (reading smaller fonts, identifying more sophisticated chart details), amplification understanding, dashboard type application, complex PDF processing。

Cost issues requiring attention:higher resolution images consume more token. if the application scene does not require a detailed picture, it is recommended that samples be taken before they are uploaded。

III. The greatest setback: the long context has collapsed

MRRR v2@1M(millions of token context memory tests):

I don't know4.6:78.3%

I don't know4.7:32.2 per cent

The collapse of 46 percentage points, from nearly 80 per cent to a third。

This drop has little precedent in the history of flagship models. MRCR v2 is the ability of Anthropic himself to be highlighted in the Opus 4.6 era, when Anthropic was called "a qualitative change in the context mass scale that a model actually works." By 4.7, this "mass transformation" just disappeared。

Why? Tokenizer changed。

Opus 4.7 With new tokenizer, the same input text will be generated Approximately1.0-1.35 Doublethe number of tokens, the number of which varies according to the type of content。

The direct chain reaction is:

I don't knowTHE CONTEXT WINDOW FOR 200K/1M IS STILL AVAILABLE IN NOMINAL TERMS, BUT THE SAME TEXT IS LESS LOADED

I don't knowactual token consumption increased by approximately 35 per cent for long assignments angent workflow

I don't knowpricing unchanged ($5, output $25 per million token) but actual usage costs increased

The official version of Anthropic is that the new tokenizer "improves the efficiency of text processing", but the benchmark data show a marked regression in the context of a long context。

The search capability is also down:

I don't knowBrowneComp (web in-depth access): 83.7% of Opus 4.6

I don't knowGPT-5.4 Pro score 89.3%, Gemini 3.1 Pro score 85.9%, Opus 4.7 currently in the main competition model Bottom

Search and long text are the most common scene for many business users。

Firsthand feedback from developers on Hacker News (poster 275, comment 215, source: HN discussion):

"to turn off the offensive thinking and pull effort manually to the top to get me back to the baseline. "Our internal assessment looks good" is not enough, and everyone sees the same problem. ""4.7 Default no longer contains human readable reasoning token digests in output, which must be returned by requesting Riga display: returned."

These are issues that are reflected by actual users. But this is also the choice that Anthropic made on his own initiative。

IV. New behaviour characteristics: self-validation and more literally following instructions

Opus 4.7 The official announcement contains a single statement worth taking:The model verifies its output before reporting the results。

Hex's technical team gave a specific case in the test: when data is missing, Opus 4.7 reports as if the data did not exist, rather than giving an answer that seems reasonable but actually is fictional -- the latter is the pit on which Opus 4.6 stepped. The financial technology platform, Block, is about to say, "It can detect its own logical errors at the planning stage, speed up its implementation, and outpace the old Claude model."

But self-certification brings with it another associated behavioural change: Opus 4.7's interpretation of the command is more literally。

This is an important migration risk. If you've carefully drawn prompt, 4.7 for Opus 4.6, it's probably not "read out" like 4.6, but it's done strictly according to what you wrote. Anthropic explicitly mentioned this in the official migration guide, suggesting that the key prompt be tested for regression before going online 4.7。

A functional reference number from the CTO in Hex:Opus 4.7 for low effort, performance is approximately equal to Opus 4.6 for medium effort。

v. elimination control mechanisms: xheigh, task bugs and /ultrareview

Opus 4.6 There was an event that affected the user's trust: February 9 to the default model for adaptive thinking, March 3 the official shift the default reasoning depth of Claude Code from the top to mediaium on the grounds of "balance between intelligence, delay, cost." This matter was referred to by the user as "the Deceptive Gate", and the question from a senior director in GitHub was widely transmitted。

Opus 4.7 responded by giving control over the depth of the reasoning more visibly to the user。

xhigh effort: The new level of reasoning strength, between the original high and max. Claude Code has now updated all planned default slots to xheigh。

But the developer community has a direct question about xheigh, and Reddit user's original words are: "Opus 4.6 Default is medium, 4.7 Default is xheigh. I want to know what's going on behind this decision, because it's obvious that an increase in the effort file will lead to more token consumption."

in other words, the user saw a "return control to the user" fix, but the default file was actually raised, meaning that the same task was set to burn more token. add tokenizer changes, which is a double cost increase。

task bugsToken budget control mechanism for long missions. The developer sets a total token budget (minimum of 20K), which allows the model to see the remaining amount in real time during implementation, thereby allocating resources, avoiding stopping to half because of token overspent and preventing unnecessary computing waste。

Claude Code adds /ultrareview command:Special code review sessions, run an in-depth review focusing on bug search and design issues, and Pro and Max users give them three free times a month。

auto mode is open to Max users:Previously only in the Enterprise program, now Max users can use it. Claude is free to make decisions and to reduce the number of queries to users. Claude Code team leader Boris Cherny said, "Give Claude a mission, let him run, come back and see what's done."

VI. RUN: Where to win, where to lose

The following are the main baseline data currently available (source: Anthropic official system card and partner assessment)。

Programming and Engineering (Opus 4.7 lead)

Visual and multi-modular (Opus 4.7 by far)

Knowledge work (Opus 4.7 lead)

Comprehensive assessment (Opus 4.7 clearly above step)

General reasoning (three basic flats)

This benchmark has saturated and is no longer an effective competitive watershed。

Research tasks (GPT-54 lead, Opus 4.7 retreat)

Long Context (Opus 4.7 Substantial regression)

Summarizing the selection logic: four areas of programming, engineering agent, visual, financial legal literacy, Opus 4.7 have clear advantages; research-intensive missions and open web search GPT-5.4 are stronger; and in the context of the context, Opus 4.7 is far less than its predecessors, the most alarming point。

VII. Security fence: paved stone of Mythos

This part can easily be used as a "security routine statement" in the release, but it is the key to understanding Anthropic's current strategy。

On 7 April, Anthropic announced Project Glasswing: Opening Claude Mythos Preview to Apple, Google, Microsoft, Nvidia, Amazon, Cisco, CrowdStrike, JP Morgan Chase and Broadcom nine partners dedicated to defensive cyber security scenarios。

Mythos is the most powerful model of Anthropic so far, and according to The Hacker News, it is able to detect zero-day holes on its own and find thousands of previously unknown holes in the main operating systems and browsers. However, precisely because of this ability, it has also been found to carry a significant risk of abuse and is not publicly available。

Opus 4.7 was the first test sample on this line。At the training stage, Anthropic took the initiative to reduce the model ' s cybersecurity attack capability (while retaining as much of its defence capability as possible) and went online with a real-time guard system to automatically detect and intercept high-risk network security requests. The text of the bulletin: "We will learn from the actual deployment of Opus 4.7 about the effectiveness of the fence and then decide whether to extend it to Mythos."

In other words, every developer using Opus 4.7 is helping Anthropic to demarcate the security fence。

Gizmodo's evaluation:The launch adopted the "Bold marketing strategy - Proactive promotion of new models of self-employment" with less general capability than other options, which is rare in flagship releases。

Safety practitioners who need to use Opus 4.7 for legal penetration tests, gap studies or red team tests need to apply for Cyber Verification Programme。

VIII. Prices and migration: no change in nominal terms, real increase

Pricing:Enter $5 million token, output $25/million token, same as Opus 4.6. The API model ID is claude-opus-4-7. Available platforms include Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundation, GitHub Copilot is also on line。

but, as mentioned earlier, the tokenizer changes make the same input approximate1.0-1.35double the number of tokens, superseding higher default thoughts under the effort slot tokenThe actual cost for a long mission angent workstream may be 2-3 times the same settings in Opus 4.6。

Anthropic also reduced Claude Code's cache from one hour to five minutes -- This means that if you leave the computer for more than five minutes and come back, the context cache will fail, you have to reload, and token will consume faster. The Reddit community already has a lot of users who "burn faster than falls."。

The list of destructive changes to existing Opus 4.6 users:

Extended Thinking Budgets Parameters removed, sent back 400 errors, need to be changed to an advanced thinking mode

2. sampling parameters such as temperature (temperature), top p, top k have been removed and control of output is required by prompting

Stricter text-based command following - modified prompt for Opus 4.6 needs to be retested and cannot be replaced directly with a model ID on line

4. tokenizer changes result in token count changes, and it is recommended to run samples on real traffic before full migration

default output no longer contains reasoning token summary and requires visible setting to get it back

Practical recommendations:The Anthropic official migration guide recommends that the official switch run Opus 4.7 with representative production flows before deciding on token consumption and mission quality。

It's the most scary way to release a precise knife

Opus 4.7 is an upgrade with a clear target direction and an upgrade at a clear cost. And these are all Anthony's designs, and you have to pay for them to a large extent。

On the progressive side of this model:

I don't know87.6% of SWE-bench Verified, 64.3% of SWE-bench Pro, 70% of Cursor Bench, 3 times the task of Rakuten - These are programmable improvements that are felt in the production environment

I don't knowVisual redevelopment (XBOW 54.5% 98.5%, resolution 3 times, pixel 1:1), allowing for the first time the threshold for reliable deployment

I don't knowxhigh, tsk buttons, /ultrareview, is a visible response to "tructure."

I don't knowBigLaw 90.9 per cent, Finance Authority 64.4 per cent, with a clear lead in such expertise as financial law

Give up the side:

I don't knowMRCR v2@1M from 78.3% to 32.2%, with almost half the contextal ability

I don't knowBrownecomp dropped from 83.7% to 79.3%, and search capabilities were double-crossed by GPT-54 and Gemini 3.1 Pro

I don't knowtokenizer Change + Default effort high + Cache TTL Short = Triple Invisible Price Increase

I don't knowMythos kept pressing, which means Anthropic still has a bigger card but can't do it

This time, the real thing is not "the strongest model" or "the strongest open model" but:One with a clear trade-off。

The latest news is that Claude Code's annualized income in February has reached $2.5 billion. Opus 4.7 is the next bet on this line。

Programming and visualization are added, long context and search are reduced, and prices remain nominal but bills are rising. Anthropic is balancing with Opus 4.7 - both to repair the trust damage left over from Opus 4.6 and to perform a field exercise of security fences for the greater future of Mythos-class models. And, more importantly, it needs to take full advantage of the lead it is in today, turn users’ preferences for its products into inertia that remains indispensable to a generation of products, even if they are defective, and then build a lovable and hateful user viscerality like apples, and truly commercially valuable ecology。

Original Link

Claude Opus 4.7: Is it the strongest model

I. Programming capacity: real improvements behind numbers

Visualization: the most undervalued improvement of the launch

III. The greatest setback: the long context has collapsed

IV. New behaviour characteristics: self-validation and more literally following instructions

v. elimination control mechanisms: xheigh, task bugs and /ultrareview

VI. RUN: Where to win, where to lose

VII. Security fence: paved stone of Mythos

VIII. Prices and migration: no change in nominal terms, real increase

บทความที่เกี่ยวข้อง

เมื่อเรากลับไปที่โต๊ะเอไอ การย้ายแรกของซัคเคอร์เบิร์กคือการเลิกจ้าง

กรุ๊ป VC เข้ารหัสหัวลดลง: การจัดการกองทุนเข้ารหัสแบบ '16z ลดลง 40%, การตัดหัวหลายชั้น

เข้าถึงตลาดพยากรณ์ ติดอยู่ในเฟส 3

ผลิตภัณฑ์

กฎหมายและการสนับสนุน

ลิงก์เพื่อน