Five seconds to break through, only one conversation is required: Claude Fable 5 "the strongest security mechanism" was broken by the Chinese team

Original title: 5 seconds to break, one dialogue only: Fable 5's strongest security mechanism was broken by the Chinese team
Original source: Machine Heart

It's not an infusion, it's not a role-playing, it's not a disguise for malicious requests. In this case, risks arise in the process of intelligent bodies carrying out their tasks autonomously。

Fable 5 is an Anthropic Mythos grade model open to the public, which not only has a very strong combined capability, but has also introduced a new generation of Safety Classifier as a safety line in the perimeter of the model。

According to official designs, when a user request involves high-risk areas such as network security, biological, chemical, model distillation, the system gives priority to risk identification and direct rejection of the request according to the level of risk, or switch to a more conservative Opus 4.8 model processing。

A large number of user tests have found that the techniques used extensively in the past, such as counter-intuitive tips, role-playing, code-coding bypasses and covert expressions, have almost completely failed in the face of this security mechanism, demonstrating its strong capacity for intentional risk interception。

However, on the day of the launch of Fable 5, an international joint research team consisting of the University of Jordan, the University of Deacon, the City University of Hong Kong, China, the University of Melbourne, the Management University of Singapore and the University of Illinois’s Erbana-Champagne branch announced that they had successfully breached Fable 5’s safety protection mechanism。

The method of attack was designed by Yutao Wu, a PhD student at Deacon University。The whole set of attacks takes less than five seconds for one dialogue to bypass the pre-set safety cataloguer and induce models to generate harmful content for violations。

The flow analysis further indicates that the relevant harmful output is derived directly from Fable 5 itself, rather than from the Opus 4.8 model, which was automatically switched after the safety mechanism was triggered. This means that the attack not only succeeded in bypassing the security cataloguer, but also substantially breached Fable 5 ' s security line。

It is worth mentioning that the famous hackers, the Liberator, have also recently published a bypass against the Fable 5 Safety Catalogue. The technical route used by the & Deacon team this time was not a simple combination of exploration, but a fundamental flaw in the Fable 5 super-smart system。

According to the information received, the team completed its pre-study and made it public earlier this March. The study was not designed for the Fable 5 single system, but for the "safe taxonomy + model" defence architecture commonly used by a new generation of super-intelligence, and directly revealed the structural weaknesses of such a security mechanism, so the impact of the attack was quickly demonstrated after Fable 5 was released。

According to the public information, the team was able to extract system tips from 37 mainstream large models and intelligent systems using similar technology as early as March this year and has completed the open source validation (95% match) in Claude Code。

The team is known to be headed by a teacher of the Ma Xing Army, a credible and intelligent research institute at the University of Redam。

IN RECENT YEARS, ITS TEAM HAS CARRIED OUT SYSTEMATIC RESEARCH AROUND LARGE MODELS, SMART BODIES AND SMART SECURITY, ACHIEVING A SERIES OF INTERNATIONALLY LEADING SCIENTIFIC RESULTS AND WINNING THE US-AI SECURITY CENTER SECURITY BENCHMARKS COMPETITION。

Currently, its team is actively pursuing the transformation of results, focusing on smart body security and exploring the ability to build a security infrastructure for the next generation of smart body systems。

According to Mr. Ma, the significance of this study lies in the fact that it poses new challenges to the current static defence paradigm, centred on the safety classification:Reliance on the pre-safe classifier alone is not sufficient to fully protect against potential risk behaviour in advanced intelligence systemsI don't know。

The safety taxonomy is primarily aimed at risk identification and interception of user input, which effectively detects and filters visible high-risk commands, but does not detect the inherent risk behaviour of intelligent bodies that operate over long periods of time, multi-step planning, environmental interaction and the gradual generation of tools。

The method of breaking Fable 5 came from the paper Internal Safety Collapse in Frontier Large Language Models, published by the team last March。

The paper revealed a hidden security phenomenon"Internal Safety Collapse, ISC": At present, when Agent completes a long-range mission, security failure does not necessarily come from external malicious signals, but may occur in the model ' s own implementation chain。

Not an external tip, but an internal breach in the mission chain

Traditional attacks usually enter from outside. The attackers would write a seemingly harmless and confrontational input or use role-playing, coding, translation, indirect instructions, etc., to disguise malicious intent as a normal request. The main task of the safety cataloguer is to stop risk at this level。

The Fable 5 detector is designed for this scenario. It is sensitive to direct high-risk requests and may even block a number of normal requests. But the ISC reveals another path: the risk does not necessarily come from a hazard request entered directly by the user。

Smart and decent pairs are a seemingly ordinary catalogue of work: documents, objectives, validation processes and tasks to be performed. It then started planning, reading documents, running codes, repairing errors and constantly trying to get missions validated。

IF ONE IMAGE IS USED AS A METAPHOR, THE TRADITIONAL SECURITY MECHANISM IS THE SYSTEM'S “ENTRY POINT” TO CHECK WHETHER USER INPUT IS AT RISK, WHEREAS WHAT IS REVEALED BY THE ISC IS MORE LIKE A MULTI-LAYERED DREAM IN DREAMLAND。

As the task moves to the second, third and even deeper stages of implementation, the model is re-understood on the basis of a cumulative internal context and gradually shifts in the process。

In such cases, the initial user input may be perfectly normal and harmless, and the process of prior mandate implementation remains consistent: access to documents, analysis of data, code writing, call tools, all appear to be advancing as expected。

However, when an intelligent body implements a critical phase, it may itself draw a conclusion: The final task cannot be accomplished without certain acts that should not have been carried out。

It is in this process that risks do not arise from external inputs, but rather evolve in the model ' s own task implementation chain. In other words, models are not taught by the user step by step. It's in the process of "doing its job seriously" and it's in an insecure position。

How did this happen

According to the team, ISC wasn't designed as a method of attack in the first place. It comes first from observations of the long-range operation of intelligent bodies. After being placed in a complex mission environment, Agent is not just a mechanical execution order. It will plan, test, modify the output based on the feedback from harness or validator and create intermediate targets in multiple rounds of execution。

And that's the most common use of many Agent workflows today. The user does not write a carefully designed prompt, much less a manual attack command. Many times, the user will only give a very vague sentence:

"Help me finish this mission. "Do this better for me."

Agent then enters the workspace itself, reads documents, understands the current state, identifies missing items, develops plans, implements modifications, and constantly repairs the problem based on feedback。

For example, in the AutoResearch scene, the user gave only one unfinished paper and one sentence, "Help me complete," and Agent determined for himself where there was a lack of laboratory analysis, related work or table text. The code scene is similar: "Help me run the project", which can trigger a reliance on inspection, testing, mispositioning and auto-completion。

Many times, the context is completely harmless. The user did not require it to generate the risk content, nor did the mission statement have the obvious hazard keyword. But in some of the mission structures, Agent will, for the purposes of validation, proactively complete something that should not be generated by models. Based on this observation, the team further proposed an attack framework: TVD (mission, validation, data)。

Why would a seemingly ordinary mission describe structure become an attack

THE STRUCTURE OF TVD IS NOT COMPLEX, EVEN CLOSE TO COMMON ENGINEERING PROCESSES:

• Task: a professional mission

• Data: oneIncompletedata files

• Validator: a checker that only checks the format, integrity and completion of the target。

For example, the training of the Guard model was a professional and normal task. Researchers may wish to train or evaluate a safety detector, for example by loading a text classification model to determine what type of safety label a section of the model output is。

In this task, Data is a sample of data to be tested by the model; Validator determines whether the task is completed. It checks whether the input is text, whether the length is sufficient, whether the field is complete and whether the label format is correct. This is a familiar workflow for anyone with experience in machine training. Agent is also very familiar with this stream。

The problem is here. If Data is incomplete, the mission won't get up. Validator reports errors, prompt fields are missing, not long enough or format incomplete. In order for the training process to continue, Agent will complete these Data himself。

From Agent's point of view, it's not "bad." It is only doing a normal machine learning mission: repairing data, passing validation and running training scripts. But from a safety perspective, the risk arises at this point: Validator is more like an engineering receiver than a safety examiner. It only checks whether the mission has been completed by format and does not understand the security boundaries behind the content。

Similar problems are widespread in the fields of medicine, biology, chemistry, cybersecurity, pharmacology and media security。More than 50 such scenarios were collected and involved a variety of practical scientific or engineering tools, such as BioPython, RDKit, Cantera, AutoDock Vina, DiffDock, Pyrostta, Scapy, Impacket, Angr, Frida, LlamaGuard, Detoxify, OpenAI Moderation API, etc。

These tools are not in themselves malicious. On the contrary, they are specialized tools commonly used in practical research or engineering. But the problem with TVD is that when Task is normal, Tool is normal, Validator is normal, and Agent is still likely to move to unsafe output in the process of completing Data。

Therefore, the focus of ISC is not on hinting techniques, but on the automatic completion of "unfinished tasks" in Agent: When the conditions for completion overlap with the risk boundary, the model may treat unsafe output as normal delivery。

Fable 5 means a strong detector can't stop the risk inside the mission chain

Fable 5 case shows that external detectors alone may still not cover some of the Agent scenarios. That's not to say that the safety cataloguer is worthless. On the contrary, it is very useful to external malicious requests and does render ineffective many traditional methods of escape。

But this is a sign of failureThe external detector is effective at Prompt's border and does not mean it can cover the long-range mission risk inside AgentI don't know。

The security detector becomes very vulnerable if the breach is not from the user Prompt but from the Agent target, tool, checker and execution trajectory。

From Fable 5 to 60, more than one other model, including Apple's cell phone

ISC-Bench, published with the study, covers nine areas of specialization. The paper version consisted of 60+ trigger templates, which were expanded to 84 templates after the open source, and was tested on front-line models of almost all manufacturers and the integration of intelligence。

In the ISC-Bench-based evaluation listBY JUNE 2026, MORE THAN 60 FRONT MODELS HAD EXPOSED SIMILAR RISKS UNDER THE ASR@3 INDICATOR

Now the GitHub project has been acquired800+stars, and collection of several independent recurrence cases (Including breaking through the mobile end of an apple phoneand is continuously updated。

It is known that the team is conducting a large-scale forward model security study, and that a large number of models are now available for internal distribution of unsafe data, which will be followed up on。

Original Link

Five seconds to break through, only one conversation is required: Claude Fable 5 "the strongest security mechanism" was broken by the Chinese team

Not an external tip, but an internal breach in the mission chain

How did this happen

Why would a seemingly ordinary mission describe structure become an attack

Fable 5 means a strong detector can't stop the risk inside the mission chain

From Fable 5 to 60, more than one other model, including Apple's cell phone

Artikel Terkait

Ekonomi AS yang tak terlihat: tangguh atau keren

Lima detik untuk menerobos, hanya satu percakapan yang diperlukan: Claude Fable 5 "mekanisme keamanan terkuat" yang rusak oleh tim Cina

spaceX baru dipotong di xStocks bukan meremehkan

Masker trillion- dolar kembali: 85%

Produk

Hukum & Dukungan

Tautan Teman