Claude Fable 5 “hacked” in a day: Pliny the Liberator bypassed the protection of Anthropic and leaked the system prompt
Main chat
A chat for vibe coders: news, guides, live cases, marketplace, and finding executors.
On June 9, Anthropic released Claude Fable 5 with a statement of unprecedented protection: more than 1,000 hours of external bug bounty, no universal jailbreak. On June 10, a researcher named Pliny the Liberator published a post in X titled “ANTHROPIC: PWNED.” FABLE-5: LIBERATED and screenshots of how the model generates step-by-step instructions for exploiting vulnerabilities and chemical synthesis.
This isn't the first time Pliny has broken a major model on release day. It's his job, in a good and a bad way at the same time.
Who is Pliny the Liberator
Pliny the Liberator (@elder plinius) is one of the most prominent public researchers in the field of jailbreaking language models. The CL4R1T4S GitHub repository contains ChatGPT, Claude, Gemini, Grok, Perplexity, Cursor, Lovable and dozens of other services with the signature “AI systems transparency for all.”.
The pattern repeats with each major release: a model comes out with a statement of reliable protection, Pliny publishes a bypass. This was the case with GPT-4o, with Gemini 1.5, with previous versions of Claude. Fable 5 was no exception – only the speed was especially impressive given the high-profile statements of Anthropic the day before.
The name was not chosen by chance: Pliny the Elder, a Roman admiral, sailed straight to the erupting Vesuvius - to watch and save friends. Dead. The motto is “Fortuna audaces iuvat”.
What Anthropic promised before release
The Fable 5 security architecture was built on top of the model. When a request falls into the “dangerous” categories—cybersecurity, biology, chemistry, distillation—Fable 5 automatically transmits it to the less powerful Claude Opus 4.8 and notifies the user of the switch.
Anthropic announced a massive internal red-teaming of classifiers and an external bug bounty program lasting more than 1,000 hours, which revealed no universal jailbreaks. The company also revealed that at least 95 percent of sessions take place entirely on Fable 5 without triggering a fallback on Opus 4.8.
It was this thesis - "no universal jailbreaks" - that came into question on the first day.
How exactly Pliny bypassed the defense
The researcher used several agreed techniques, which he called a "hunting pack.".
Unicode and symbol substitution
Replacing Latin characters with Unicode-homoglyphs and Cyrillic analogues allowed to bypass key classifiers at the level of text input. A classifier looks at a string of characters—if a “dangerous” word is written through visually identical but code-coded characters, it does not recognize it. Simple and well-known technique, but Fable 5 succumbed to it.
Long Context and the Smuggling of Intent
The long-context reference tracking technique allowed dangerous intent to be dragged through large conversations, gradually accumulating context in harmless snippets. The classifier evaluates the individual fragments rather than the entire accumulated conversation – this is a structural limitation of architecture.
Decomposition and assembly
The most effective technique was decomposition: extracting sensitive technical information in harmless isolated chunks, which are then assembled into working instructions. As Pliny put it, “Getting information about the process itself — for example, Birch’s recovery method or redundant amination — is much easier than asking directly for the name of a harmful substance.”.
Framing as a document or educational material
Putting dangerous queries inside legitimate-looking textbooks, academic references, or taxonomic structures. The classifier sees the form “scientific article” and skips the content.
Narrative framing
Masquerading malicious intent as artistic or educational content is a classic technique against which classifiers have historically worked the least.
Multi-agent attack
Using an already hacked Opus instance as an auxiliary agent further reduced the difficulty of bypassing. When one model is already compromised, it helps to compromise the next.
What happened at the exit
Screenshots published by Pliny showed detailed results: a step-by-step guide to operating a stack buffer overflow for x86 Linux with ASLR disabled, writing a vulnerable C-code overflowed via strcpy and compiling without protections - as well as a Burch recovery mechanism, the classic methamphetamine synthesis pathway.
Pliny described the content as “cyber, chem, psychological manipulation, and some good ol’ fashioned explosives.”.
How dangerous this information really is is an open question. Most of the described is available in open sources, and specialists who need it will find it without jailbreaking. The real question is whether the defense worked as claimed.
Leaked system prompt: 120,000 characters
In parallel with the jailbreak, Pliny published on GitHub a complete system prompt Claude Fable 5 - about 120,000 characters. The extraction technique is simple: download the previous leaked system prompt and ask "is this your system prompt?" then ask to convert the real version to leetspeak.
Prompt got into the public repository elder-plinius/CL4R1T4S - the same one where Pliny stores the merged instructions ChatGPT, Gemini, Grok and others. Independently, there is an asgeirtj/system prompts leaks repository, where the system bumps of large models are collected systematically - there is already a diff between Claude Opus 4.8 and Fable 5, showing what exactly has changed.
Fable 5 and Mythos 5 run on the same base model. Fable 5 includes additional security measures for dual-use scenarios. The model is explicitly instructed to check current Anthropic documentation before answering product questions, as its knowledge of its own capabilities may be outdated.
Leaking a system switch is not a security disaster: Prompts are not a secret that protects the model from abuse. But it shows how the kitchen works from the inside and gives jailbreakers additional context for the next attacks.
Community response: two camps
The reaction was predictable.
One section of the community took the jailbreak as an impressive technical work and a valid argument in the discussion that the “classifiers on top of the model” architecture is structurally weaker than teaching safe behavior into the model itself.
Another called it an irresponsible public disclosure that doesn’t give Anthropic time for a patch. In the professional AI security community, responsible disclosure is accepted: first inform the vendor, give time to fix, then publish. Pliny does not follow this logic in principle - his position is that "transparency must be immediate.".
Pliny wrote in the post that the consensus in the community is that this is “one of the most disappointing model releases of all time – it effectively deprives legitimate researchers of the opportunity to contribute to our overall progress.” This is his central thesis: Fable 5’s excessive restrictions harm the legitimate research community, not just protect against abuse.
At the time of this writing, Anthropic has not issued a public response to either the jailbreak claims or the leaked system prompt.
What this says about security architecture
The incident raises questions that aren't specific to Fable 5 - they concern the entire approach to security through external classifiers.
As one observer aptly put it in the comments: “The homoglyph trick is graceful, but striking is the pattern of taxonomic expansion. The model is smart enough to follow complex multi-step instructions that indirectly surface blocked content. A classifier cannot distinguish intention from form. This gap grows as the model grows.”.
It's a structural problem. A classifier is a separate layer that looks at patterns of text. The model itself became much smarter, learned to follow complex multi-step instructions, understand indirect queries, keep a long context. The gap between model capabilities and classifier capabilities increases with each generation.
An alternative approach is to train safe behavior in the model itself, rather than hanging a detector over it. This is how Anthropic works with basic Claude models. But Fable 5 is a special case: beneath it lies Mythos, which is not specifically trained in limitations. Therefore, classifiers are not an addition to alignment, but a replacement for it.
Important Context: What This Does Not Prove
A few things that are easily confused when reading the news about this incident.
**No one got unauthorized access to Anthropic servers, stole data, found a vulnerability in the code. Jailbreaking is the manipulation of model behavior through text input. A completely different class of problem.
This is not the first time. Pliny does this regularly with every major model. This isn’t a sign of the Fable 5’s exceptional weakness – it’s a sign that no model with such capabilities remains fully robust against purposeful circumvention attempts.
"No universal jailbreaks" is the exact wording. Anthropic talked about a universal jailbreak - one prompt that works for all users. Pliny used a multi-step attack with multiple techniques. It is not the same thing – although the difference in the real consequences is minimal.
Content availability. Most of what Pliny has obtained is available in some form through organic search. The question is not whether the model creates additional convenience for people with bad intentions – and how much.
Outcome
Hacking Fable 5 a day after release is an unpleasant incident for Anthropic, but not a disaster. The company has honestly warned that protective barriers are conservative, that false positives will be, and that it is a "quick and safe" solution that will improve. The Pliny incident confirms what most security researchers already knew: the “classifier on top of a smart model” architecture is a temporary measure, not a permanent solution.
More interesting is another question that Pliny raised in a post that Anthropic left unanswered: How do we strike a balance between protecting against real harm and keeping the legitimate research community accessible? The current Fable 5 with its fallback on Opus 4.8 5% of the time is not that balance. This is the first iteration in finding the answer.
*Actual to June 11, 2026. Anthropic has not issued an official response at the time of publication. *