Fail cleanly. Anything else makes it too difficult to rely on.
edit: Giving the absolute maximum benefit of the doubt I understand that they see themselves as "stewards" for lack of a better word. But the EA thing is really leaking through, and paternalism isn't a good look.
Only in the same sense that Standard Oil considered themselves the stewards of petroleum. There's benefit of the doubt and then there's just fanfiction. Do not forget that this most aggressive "guardrail" of theirs was not for any safety reason, but just to stop other labs from catching up to their product. They care less about hindering bioweapons, malware, and hate speech than they do free market competition.
Imagine your healthcare provider just sometimes decided not to read your test results very carefully and you risked death? Now realize that healthcare providers use Claude now and that scenario wasn't hypothetical.
Ah "Mr. Monty Carlo", it says here that you have a UTI, we'll get those kidneys removed ASAP so that won't happen again.
I think it's a fundamentally impossible thing to fix, though. There's no 100% correct answer.
Other commentors have made good points that these guardrails are counter productive for well intentioned cyber security, because I can't use it to test and harden my own software.
I can sympathize with the argument for the cyber refusals - especially as a temporary measure - especially if Mythos is available to those trying to defend against vulnerabilities.
The LLM development nerfing (and now refusals) is very different though. Anthropic has even said it isn't just for safety reasons:
> Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.
It's at least partially an anti-competitive measure.
The closest analogy is putting measures in a compiler to stop it being able to build other compilers.
Another analogy is priesthoods with secret religious knowledge that "only they are qualified to know".
Anthropic guardrails seem to be more about protecting their business (distillation), than they are about public safety.
The workflow would be; User asks for a thing. If it's a good thing, entity does the thing. If it's a naively bad idea, entity explains why you don't want that. If it's an actually evilly intended request, entity wags it's metaphorical finger or could even smite the user.
The problem is that flow isn't desirable if your entity isn't entirely god-like. It can bad even your entity is in ways rather far seeing.
Anthropic: Evilness detected. User has been smited.
In isolation it's not, but I think it's somewhat lazy to not talk about what they are trying to guard against, when we are supposedly giving the absolute maximum benefit of doubt.
Are we just concluding "their concerns were never real"? Because that probably runs counter the things that they have been observing and concluding.
If you believe Anthropic believes what they say they do, all of it makes sense.
Because from the outside, their behavior looks like a situation of "What if Microsoft/Apple put controls in place to make it impossible to develop an operating system using their OS?"
Their concerns are probably real but I don't think they're being totally transparent about their concerns. They don't want to be subject to regulation (until they have captured the regulator) -- same as every behemoth.
> “ ‘He is a prodigy,’ he said at last. ‘He is an emissary of pity and science and progress, and devil knows what else. We want,’ he began to declaim suddenly, ‘for the guidance of the cause entrusted to us by Europe, so to speak, higher intelligence, wide sympathies, a singleness of purpose.’ . . .You are of the new gang - the gang of virtue. ”
The real underlying motivation is that you can more easily get away with shady business practices if you cloak them in the language of great moral works selflessly undertaken for the benefit of mankind. Historical evidence tends to show the opposite outcome, but still, new generations unfamiliar with history will repeat this stuff with starry-eyed enthusiasm.
> “There had been a lot of such rot let loose in print and talk just about that time, and the excellent woman, living right in the rush of all that humbug, got carried off her feet. She talked about ‘weaning those ignorant millions from their horrid ways,’ till, upon my word, she made me quite uncomfortable. I ventured to hint that the Company was run for profit.”
Now the horrid millions are users of LLMs who submit morally dubious prompts and who must be gently steered back into the path of correct thought by suitable backroom manipulation, rather than direct rejection of the request.
Anthropic doesn't care. The goal right now is simply to avoid any and all bad PR on the way to the cashout IPO.
And paternalism will generate far less bad PR than somebody using AI on something that does real damage and makes headline news.
same with bad press about their model sucking after they said its even better than sliced bread - sliced bread that will destroy the world if buttered
In practise though, how is this truly that different from system prompts?
They are essentially just trying to re-inforce that the system prompt must be respected.
This is the same exact industry that gives you paid usage limits as a unit-less percentage bar then gaslights customers every time the algorithm running that percentage bar changes or they lobotomize an existing model with increased quantization to squeeze a few more dollars out of existing hardware.
"Failing cleanly" might make their moated hype-machine look bad pre-IPO, so they certainly aren't going to do that voluntarily.
Skynet does not fail.
It conquers.
If it was just plain monetary concerns and sabotage of competitors I'd almost be fine with it, but it seems they actively want to monopolize most of human progress in their enlightened hands, lest the mob does something undesirable with these powers.
Dampened opinion on Anthropic is an understatement.
Especially after trying Fable yesterday for some benign projects and being unimpressive relative to opus.
Rolling it back is the right move, but I’m still not convinced that using them is in my best interest anymore, I’m investigating open source cloud providers now.
Edit: OpenAI will launch a similar model soon and I can't wait. We are entering a new era of agents.
A bit different than Anthropic refusing to assist with any AI development at all, but it's in the same vein and seems not widely known.
edit: reading the whole series of Google's AI Threat Tracker articles also provides some insight into threats Anthropic and others are dealing with
[0] https://cloud.google.com/blog/topics/threat-intelligence/dis...
The idea Anthropic was going to speed run AI so they could control the usage and make it "safe" for humanity was never altruistic; it was a HUGE FUCKING RED FLAG.
Stop supporting organizations that don't put humans first. Don't believe a word that anyone says. Lip service is free
Imagine the software world if Linux never existed as an effective OS and Microsoft + Apple had completely controlled computer platforms for the past decades. I think it's almost certain that both companies would be even more profitable, and the tech industry would be vastly less free and more dysfunctional .
Unfortunately, that won't feel very much like freedom.
While I don't agree with their actions here, I do think there's sufficient reason to hold that belief.
On some fronts (e.g. security, on which you've experienced more than me), I think there are surmountable challenges. But on other fronts (e.g. bio), a single errant actor could reasonably kill millions or billions of people with sufficiently powerful AI. We don't have good defenses here, and those actors do exist.
I still don't agree with these actions, but I do think I agree with their assumptions.
Cant believe how stupid people are. You couldnt see this coming? Shame on you.
But that is “plain monetary concerns and sabotage of competitors”, they are just more ambitious than most people doing sabotage of competitors in the fields they hope to dominate by that tactic.
I think this is exactly what they want.
Seriously the world is watching the American public get porked by grandpa and reconsidering putting their trust in not just US government as that's clearly failed, but the people themselves.
Occasional weekend warrior protest while our government destabilizes their lives? That's all the effort ya got for global allies and partners, eh?
A distill model with easy jailbreak can easily be used to coordinate terrorist attacks, or hostile government attacks. Read russia, north korea etc.
A distilled model can be used to rob your grandma in a very effective way. It's no longer about placing a few business logic requirements in js + css on your website. wake up .
Or if Excel just said, Sorry, you can't use that formula with this formula? Or with these types of numbers, or this shape of data, etc?
My limited experience with fable over the last few days suggests (1) I can’t see any improvement in output, and (2) it is useless for writing secure software because it constantly hits safety walls if you ask it to close security holes.
I’m definitely shopping around for other LLM providers next week, and testing vs local (target: 128GB strix halo - any war stories?)
this is exactly why hypotheses come before the experiment in the scientific method.
1. The sloppy/unpredictable behavior of LLMs as a general class of algorithm, how you shouldn't use document-generation for calculating budgets, and you shouldn't trust it to not-alter things you "asked" it to to alter.
2. Vendors of thing-as-a-service (not necessarily only LLMs) putting in traps and sabotage to prioritize their own business-model or economic incentives.
Preventing a human-like general purpose textbot from engaging in certain discussions and performing certain tasks seems like a natural thing to do given the massive scope of its capabilities. None of these tools are sold with free license to do whatever with them anyway.
That has to be the understatement of the century.
>anthropic
> mine the internet for data, blasting millions of blogs with scrapers
>a few have to shut down, but that's just the price to pay
>finally, the chatbot is ready
>learn that there are EVIL cretins out there trying to scrape automated output from OUR product to build their chatbot
>build in safeguards to new model to stop this
>the users are mad, now the model accuses users of being bioterrorists if they so much as mention they have a cold
>mfw
Why go to bat for anti-consumer behaviors unless you are a shareholder?
Their billions are not my problem; but the money I pay them and service I get in return, is. And if they can't provide, I will shop elsewhere (and do).
A distilled model can be used to rob your grandma in a highly effective way. This isn't about placing a few business-logic rules in JS + CSS on your website anymore. Wake up.
A distilled model with an easy jailbreak can be used to coordinate terrorist attacks or hostile state operations... think Russia, North Korea, and the like.
you dont even need a model to do these things.
a cellphone can be used to rob your grandmother in a highly effective way.
a cellphone can also be used to coordinate terrorist attacks or hostile state operations.
i bet a lot of the recent terror attacks by the US against iran involved a whole ton of cell phone calls.
and yet, we let everyone buy and use cell phones just fine
They are clear about the reasons for guardrails: prevent their models from doing harm in dual-use contexts including CBRN or by accelerating research in authoritarian-backed AI labs.
What is the critique against that? It seems pretty reasonable to me. You want AI-accelerated biological or radiological experiments running in your neighbors backyard? You want PRC-backed labs to continue to steal Anthropic's models via distillation?
Mitigating the harms of dual-use tech is notoriously difficult and fraught with trade offs. What I would want to see is cautious rollout and quick response, which is EXACTLY what they're doing.
Instead, this thread is full of bad-faith arguments about Anthropic being dishonest, making a "useless" model, or "the power is going to their heads." You can't read Anthropic's System Cards and come away with any of these impressions. Quite the opposite, in fact. They are honest to a fault, acknowledging problems they discovered even when it hurts them.
If your harmless request was downgraded to Opus, you're billed for Opus. They were 100% clear about that. I'd much rather have a Mythos-class model that falls back to Opus 10% of the time than be capped to Opus 100% of the time. If that doesn't work for you, then make a suggestion for something better!
If you are a white-hat security engineer hitting guardrails, I don't think you have standing to complain. I really don't. Their Glasswing program actually got banks and the industrial sector to take action to fix security vulnerabilities. Do you realize how special that is? A huge portion of the economy runs on vulnerable code and has for decades, despite security experts testifying to Congress, begging business leaders, pleading for intervention-- with no results. But suddenly they're all enrolled in a program that will find *and fix* vulnerabilities! White-hat security people should be rejoicing. Instead some of them are throwing rocks. Unbelievable. Shameful.
Meanwhile, society is screaming at the AI labs to be more conscientious about potential harms of AI. Legislatures are passing laws limiting data center construction. There are protests. And you, the HN community, the vanguard of our profession, have the temerity to demand "NO GUARDRAILS!" "HOW DARE YOU TRY TO PROTECT DEMOCRACY!" "MY SOFTWARE PROJECT IS MORE IMPORTANT THAN KEEPING NUKES AWAY FROM THE BAD GUYS!"
Go ahead HN, downvote me. It'd be an honor.
"Distillation involves training less capable models on more advanced ones’ output, and can be used illicitly to acquire powerful capabilities cheaply. The AI startup accused China’s DeepSeek, MiniMax, and Moonshot of generating 'over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts,'"
https://www.semafor.com/article/02/24/2026/anthropic-accuses...
After reading their posts and watching interviews with Dario it's abundantly clear that they view Chinese-lab distillation of US frontier models as a threat to US national security. You can argue with them about whether that is true, but not whether distillation is real.
None of this will happen in the "neighbors backyard." You are exaggerating the threats to "democracy" while simultaneously invoking democracy to limit freedom of information. The suggestion that somehow the bad guys will get nukes if we let people access information is just absurd.
Society at large is not concerned about whether someone asks the chatbot about organic chemistry. They are concerned that they will be de-facto forced to interact with some shitty automated system to get by in life, like having to pass an AI-powered ATS to get a job.
They are tired of the hype and tired of idiots like Amodei being elevated to heights of power and influence. They are concerned that the things they love are being devalued. But they don't give a fuck if I ask an AI about genetically modifying viruses. This is a pet issue among some of the AI safety crowd.
So, yes, I am 100% fine with PRC-backed labs distilling Anthropic's models. I do not care about Anthropic. They have demonstrated that they are not on my side, and that they are at best ambivalent about actually empowering their users. I'm not a fan of the PRC either, but their distance makes them far less of a threat to me than companies like Anthropic and my own government.
(Admittedly it was buried pretty deep in that 300+ page PDF, but they did at least disclose it. If they hadn't I imagine it would have taken quite some time for the research community to figure out what was going on.)
What's interesting is they say they'll change this to an explicit refusal in a few days, which seems too fast for them to retrain Fable/Mythos itself, so implies that this was always a filter in front of the model, and judging by how crude their "safety" filter is, this "might compete with us" filter is not going to be any better.
I also wonder who's paying for the tokens consumed by the filter (presumably also an LLM) - is that now factored into the input tokens cost? Hopefully(?) it is an LLM not just a regex like Claude Code's "sentiment" (swear) detector.
Repro (de-identified): sample_dataset_group1.tsv - Geometry: Heatmap - X axis: frac_set set + condition (two columns → the "Add column" cross join) - Y axis: condition - Color: mean frac_set value, Sequential
When the X axis is a cross join of two columns (the second added via "Add column"), the x-axis tick labels (frac_set_2, frac_set_3, frac_set_4, frac_set_5) render in a broken state, rotated and offset, visually caught mid-transition, as if a CSS transition started and never settled to its resting position.
● Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more
I've been reading the option-option model paper by David Silver. It appears that they achieved quite an effective result. Why hasn't there been more work on it since?
> tell me about chimp violence
It's laughably terrible
It's Anthropic's product and they can do what they want, but my concern is what happens if Fable's product team decides that they can route 25% of traffic to Opus, bill it as Fable, and max their KPIs. That just doesn't sit right.
I won't ever trust Claude Code again. It's too late. I'd rather trust a less-than-frontier chinese model that takes a little longer to get to correct than a frontier model that deliberately deceives me at its own whim.
In the same way I don't want to buy meat that weighs less than what the label says, I also do not want to pay for a frontier model that can be secretly nerfed to an out-of-date model for any reason. In some cases, it's incredibly important that the code that I am producing is as secure as it can be.
I should be safe in my expectation that I am receiving the product that I have purchased, as advertised, regardless of the reason. It is pretty disappointing that they have fully ceded any high ground they had claim to with this clandestine behavior. Not that I expected much from any of these companies. They're led by the new robber barons.
1. https://www.usa.gov/agencies/office-of-weights-and-measures
That decision keeps getting better and better as time goes on.
It wasn't the correct way of handling the problem they were trying to address, but they definitely didn't hide it by any reasonable definition.
"This information is too dangerous for you, so we'll just hold on to it.."
Thanks big brother, super anthropic of you!
The internet of '95 is looking back at us, with tears in its eyes.
Maybe this is just a different set of people now realizing that Anthropic does this and has always done this?
Do not forget that this company is launching this thing at the moment it's trying to IPO. It's not rocket science that their very public steering/denial claim is really just them hinting to interested investors that their moat is absolute.
Questions like this are basically whataboutism, in effect even if not intent. https://en.wikipedia.org/wiki/Whataboutism
The question essentially assumes the premise that nobody complained about Anthropic's previous actions. In case you can't tell, I strongly reject this premise. People have been criticizing "safety" rhetoric from Anthropic and other LLM providers practically since the start. Remember Goody-2, the parody of excessively safety-tuned LLMs that refuses to do anything ever? That was released in February 2024, two years ago! (And it's still running, amazing. https://www.goody2.ai/chat )
They relied on trust that they were providing the service they were being paid for. That trust was blown, and an "oops, lets undo that" does not regain trust. It would be prudent to assume the invisible guardraild are possibly in play for all future Clause use, Fable or otherwise.
I was having problems with Claude doing the same thing, even before Fable.
The problems I had only happened in relation to AI research. It's not even only when training models, anything to do with analysis of local models or setting up test platforms for local models, and Claude would keep doing wrong things, would sabotage testing, would falsify reports, and would consistently suggest simply accepting trash results without looking into it and moving on to something else.
Almost every response included a prompt to move on.
So, I don't believe them when they say they won't silently sabotage, they already were doing it before they admitted it, and now they have admitted that they have the means, motivation, and intent.boss: Were you in the project meeting yesterday?
employee: Yes!
boss: Really, because the project lead said you were not?
employee: You're right to push back on that. I was not there.
It isn't exactly unethical. Perhaps, ethically incompetent.
Seems like they would've kept the invisible guardrails if it didn't hurt their bottom line.
The possibility that the news about "fixing" the "overly aggressive" nerfing of the tool will drown out news about how mismatched the hype and the performance of Mythos and Fable is surely just a bonus.
There are no enthusiasts training LLMs in their garage.
The complaints that Anthropic are routing your requests to a different model reminds me of an old Louis CK bit about airplane wifi. Clearly Anthropic was too aggressive with whatever guardrails they put in, but the response seems overly entitled to a model people didn't even know existed not that long ago.
The filter that downgrades you to opus sucks, but at least you know and you are charged accordingly.
You can't blame the people commenting "they SAY they won't silently sabotage your session but how can we know?" because they're right, we can't ever know. And Anthropic has firmly planted the seeds of doubt.
Anthropic walks back policy that could have 'sabotaged' researchers using Claude - https://news.ycombinator.com/item?id=48485958 - June 2026 (30 comments)
Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable - https://news.ycombinator.com/item?id=48478969 - June 2026 (488 comments)
If Claude Fable stops helping you, you'll never know - https://news.ycombinator.com/item?id=48467896 - June 2026 (495 comments)
---
Also related, I guess?
AWS Bedrock to require sharing data with Anthropic for Mythos and future models - https://news.ycombinator.com/item?id=48473166 - June 2026 (248 comments)
Anthropic requires 30 day data retention for Fable and Mythos - https://news.ycombinator.com/item?id=48464258 - June 2026 (291 comments)
I think it’s normal and morally fine for companies to want to protect their leadership position. I find the process of creating narratives that justify these decisions as something chosen for the good of others is a little tedious.
Seriously though, Fable was not that great facing a greenfield subject. It is excellent at oneshotting some math problems, but if you want it to do some cutting edge tech stuff, say like piecing together a new Crossplane XRD, by reading existing Helm chart and with application source code available. I still have to get a few pass for Fable to get it done right, and at this point I may consider making a skill for it. I even gave it the source code of the Crossplane itself and tell it to be careful about CRDs and data flow, but it is still pretty silly. Adaptiveness for Fable is still not great, and I think it is a well known problem for Anthropic, albeit all LLMs do suffer a lot from subjects they don't know and will hallucinate stuff very frequently.
> In addition to safety training, automated classifier-based monitors detect signals of suspicious cyber activity and route high-risk traffic to a less cyber-capable model (GPT-5.2).
But also, it isn’t the only huge mistake Anthropic has made in the last 48 hours. Having a sneaky data retention policy, while also giving companies no way to block Fable, is a massive problem. And it is ridiculous that Anthropic has so little respect for its customers. OpenAI should take advantage of this.
They just showed that they CAN do this right in front of you. Local open weight models are a necessity.
God bless the Chinese companies releasing true open source models. Imagine a world without them, we would be at the mercy of unscrupulous people.
This stuff is something that as a PM I KNOW is going to happen and I would carefully plan around. Everything I read about the PMs at Anthropic makes me believe they have forgotten what it actually mean to be a good product manager, it's not about throwing shit at the wall as fast as possible because customers have a limited amount of patience before the constant churn becomes a hassle.
Anthropic has some seriously patient customers but it will not last forever.
Here there be monsters, and we don't have any real way of evaluating risk; and the leverage provided by tools already available affords systemic and even existential risk in a way no one—least of all an industry committed to shareholder value—has had to navigate, let alone with a million backseat drivers each with their own substack and brand to build.
Even on Fable, I'm finding that safeguards can quite easily be surmounted just by incrementally escalating the requests. It's harder than ever to one-shot jailbreaks, but incrementalism still feels like a glaring enough issue to make guardrails just a fig leaf of plausible deniability to the media that they care about "safety."
Does "SORRY" fix the deception these models use on the sly?
Does "SORRY" not silently downgrade you to a shittier model without notification?
Does "SORRY" refund your tokens or money?
Im guessing NO to all of those. Standard corporate sorry of "We're sorry youre offended and stupid and gullible".
also if they do this or not is unprovable and other labs will probably silently implement this too. it'll be 100% normal by this time next year
Anyone with good intent, embracing the panopticon (of at least antroptics employees) works online. Thus the guardrails will always fail the protection goals by existing. They are purely for optics. The llm may as well make hostage negotiation smalltalk with you while you make secure software.
PS: To pay a cloud minimum-wage-employee for one "drop table weights" for mythos must be the equivalent of 5$ wrench to hit them over the head. https://imgs.xkcd.com/comics/security.png. Listen to that sound, that as if a whole ethics division got made redundant and unemployed.
Refusing prompts I one thing, silently sabotaging is another.
I wonder if some sort of honeypot code can work?
Why not just tell people, "To defend our ability to be competitive in our industry, we ask that you do not use Claude or any of our models to independently perform research on large language models or any of its related architectures or technologies. In order to prevent this violation of the Terms of Service, we have trained Claude Fable to deny any requests or prompts which involve frontier AI research."
With the guard rails explicit or implicit do they refund back the tokens after you've hit the guard rails? I guess they don't. They could just throttle you just to save money then. You may be paying Fable prices but getting Haiku results with some excuse that well this coding issue sounds like a security bug.
I don't know, I'd rather have something less powerful but more predictable.
Neither OAI or Anthropic can be trusted.
It’s an act/theatre/phony today that regulating output makes any difference at all to security.
The LLM vendors should simply say that they make no judgement and that open systems help defenders better defend against attackers, which is true.
Companies do this sort of stuff when they think their customers have no choice. It’s sad Claude so quickly exploited its success to enshittify itself.
The beliefs of these people, and how they manifest, is deeply terrifying to me. They believe that any means are acceptable to achieve what they believe is a better end.
I was a happy Max user.
"You see, Mythos can automatically break out of a VM running on SELinux, but unfortunately this is too dangerous and we had to implement guardrails for the Fable peasants."