Anthropic apologizes for invisible Claude Fable guardrails (opens in new tab)

(theverge.com)

511 pointsrarisma12d ago445 comments

https://web.archive.org/web/20260611122253/https://www.theve..., https://archive.ph/y4V4k

445 comments

200 comments · 80 top-level

Avicebron12d ago· 37 in thread

I like Claude Code a lot, I think it sets a dangerous precedent to put guardrails in that return a response from a prompt that was modified by the system in real time in order to subvert the original intent.

Fail cleanly. Anything else makes it too difficult to rely on.

edit: Giving the absolute maximum benefit of the doubt I understand that they see themselves as "stewards" for lack of a better word. But the EA thing is really leaking through, and paternalism isn't a good look.

Paracompact12d ago

> Giving the absolute maximum benefit of the doubt I understand that they see themselves as "stewards" for lack of a better word.

Only in the same sense that Standard Oil considered themselves the stewards of petroleum. There's benefit of the doubt and then there's just fanfiction. Do not forget that this most aggressive "guardrail" of theirs was not for any safety reason, but just to stop other labs from catching up to their product. They care less about hindering bioweapons, malware, and hate speech than they do free market competition.

keeganpoppen11d ago

this reads like "throw everything at the wall and see what sticks" reactionary-ism... i'm guessing that it's not particularly easy to use claude to help you make bioweapons, and we all know that they have neutered Fable vis à vis security research because people have already been complaining about it. and the funny thing about hate speech is that there is absolutely no need for ai-- it tends to come out the best when spoken directly "from the heart", as it were, anyway.

ryeights11d ago

Superintelligent AI is more dangerous than a bioweapon. How, then, is this guardrail not addressing the most pertinent safety concern of all?

3 more replies

cnd78A11d ago

mixing up bioweapons, malware, with hate speech (which is basically a censorship) shows how very basic people like Trump can win. Hopefully you won't wait to be censored before realizing that anything could be interpreted as "hate speech".

mapontosevenths12d ago

I agree 100%. Doing a worse job IS an error. It should be treated as such. Or at the very least make that behavior opt-in. The default should not be pretending like nothing happened and just quietly doing a worse job.

Imagine your healthcare provider just sometimes decided not to read your test results very carefully and you risked death? Now realize that healthcare providers use Claude now and that scenario wasn't hypothetical.

largbae12d ago

Especially if your name has any machine learning terms in it.

Ah "Mr. Monty Carlo", it says here that you have a UTI, we'll get those kidneys removed ASAP so that won't happen again.

ceejayoz11d ago

Yes, but as with spam/phishing/abuse prevention, too much information about what does and doesn't trigger things can be very useful to attackers. An explicit error is something you can feed into another AI to find jailbreaks.

I think it's a fundamentally impossible thing to fix, though. There's no 100% correct answer.

1 more reply

bs728012d ago

I think the reasonable middle ground anthropic is trying to achieve is - let the organizations that make the most important and critical software get a head start on cybersecurity before they inevitably allow everyone else the same access.

Other commentors have made good points that these guardrails are counter productive for well intentioned cyber security, because I can't use it to test and harden my own software.

nl11d ago

I think it's a big mistake to conflate the cyber (and bio) refusals with the LLM development refusals.

I can sympathize with the argument for the cyber refusals - especially as a temporary measure - especially if Mythos is available to those trying to defend against vulnerabilities.

The LLM development nerfing (and now refusals) is very different though. Anthropic has even said it isn't just for safety reasons:

> Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.

It's at least partially an anti-competitive measure.

The closest analogy is putting measures in a compiler to stop it being able to build other compilers.

Another analogy is priesthoods with secret religious knowledge that "only they are qualified to know".

3 more replies

sciencejerk12d ago

Claude Opus 4.6 and 4.8 find vulns in source code just fine and 4.6 will pentest without source for you given a proper harness WITHOUT jailbreaking. WITH jailbreaks, you can probably imagine what they are capable of.

Anthropic guardrails seem to be more about protecting their business (distillation), than they are about public safety.

1 more reply

ryandrake12d ago

I wonder who gets to decide which companies make important and critical software and which ones get the scraps later.

2 more replies

whywhywhywhy11d ago

The security guardrails are one thing but they extended it to AI work unrelated to security too to protect their lead.

pseudohadamard11d ago

I see it more as a lose/lose: Any malicious user/attacker will just bypass the guardrails using one of a million established techniques for doing so while legit developers and security researchers will be prevented from finding problems by them.

wouldbecouldbe12d ago

I asked it to analyse my architecture and find any security issues and it did it perfectly, first identified the issues & then fixed them. Not sure why my prompt managed to get through the guardrails

1 more reply

notrealyme12312d ago

exactly for cybersecurity the failure was visible. It was not visible for "Frontier" ML Research. The argument of headstart in it security is no feasible here.

thefounder11d ago

There is no middle ground to shadow bans while getting your hard earned cash. It is fraud/Nigerian scam

joe_the_user12d ago

The problem is that Anthropic seems to be working up to the workflow one would naively want from AGI/some-god-like-entity.

The workflow would be; User asks for a thing. If it's a good thing, entity does the thing. If it's a naively bad idea, entity explains why you don't want that. If it's an actually evilly intended request, entity wags it's metaphorical finger or could even smite the user.

The problem is that flow isn't desirable if your entity isn't entirely god-like. It can bad even your entity is in ways rather far seeing.

dantillberg12d ago

User: Is it possible there is more than one true god? Could there ever be any competition for Anthropic's AI?

Anthropic: Evilness detected. User has been smited.

jstummbillig12d ago

> paternalism isn't a good look.

In isolation it's not, but I think it's somewhat lazy to not talk about what they are trying to guard against, when we are supposedly giving the absolute maximum benefit of doubt.

Are we just concluding "their concerns were never real"? Because that probably runs counter the things that they have been observing and concluding.

estearum12d ago

Basically all critiques of Anthropic's policy moves on these topics boil down to people not believing the fundamental concerns are real, and often then going a step further to conclude that Anthropic doesn't actually believe their concerns either.

If you believe Anthropic believes what they say they do, all of it makes sense.

3 more replies

thewebguyd12d ago

Then what is it they are trying to guard against, if its not simply protecting their moat ahead of their IPO?

Because from the outside, their behavior looks like a situation of "What if Microsoft/Apple put controls in place to make it impossible to develop an operating system using their OS?"

4 more replies

dpkirchner12d ago

> Are we just concluding "their concerns were never real"?

Their concerns are probably real but I don't think they're being totally transparent about their concerns. They don't want to be subject to regulation (until they have captured the regulator) -- same as every behemoth.

esafak12d ago

We've all been observing it. The recent spate of cyberexploits were powered by AI.

colordrops12d ago

You are arguing with a straw man. Most are saying they should be explicit with the failure modes rather than fail silently. They aren't saying there should be no guardrails.

hootz12d ago

What is "EA" in this context? I see a lot of people using this initialism.

photochemsyn12d ago

It’s rewarmed rhetoric from the late 19th/early 20th century, most effectively pilloried by Joseph Conrad in “Heart of Darkness” in the character of Mr. Kurtz:

> “ ‘He is a prodigy,’ he said at last. ‘He is an emissary of pity and science and progress, and devil knows what else. We want,’ he began to declaim suddenly, ‘for the guidance of the cause entrusted to us by Europe, so to speak, higher intelligence, wide sympathies, a singleness of purpose.’ . . .You are of the new gang - the gang of virtue. ”

The real underlying motivation is that you can more easily get away with shady business practices if you cloak them in the language of great moral works selflessly undertaken for the benefit of mankind. Historical evidence tends to show the opposite outcome, but still, new generations unfamiliar with history will repeat this stuff with starry-eyed enthusiasm.

> “There had been a lot of such rot let loose in print and talk just about that time, and the excellent woman, living right in the rush of all that humbug, got carried off her feet. She talked about ‘weaning those ignorant millions from their horrid ways,’ till, upon my word, she made me quite uncomfortable. I ventured to hint that the Company was run for profit.”

Now the horrid millions are users of LLMs who submit morally dubious prompts and who must be gently steered back into the path of correct thought by suitable backroom manipulation, rather than direct rejection of the request.

massagedpelican12d ago

Effective altruism. A lot of the folks working on AI at large tech companies are disproportionately represented in the movement. There's a lot of overlap between EA and the rationalist community as well. The wikipedia page is a good place to start https://en.wikipedia.org/wiki/Effective_altruism

5 more replies

carlgreene12d ago

Effective Altruism I think

jcgrillo12d ago

"crypto bros" to a first approximation

bsder11d ago

> paternalism isn't a good look.

Anthropic doesn't care. The goal right now is simply to avoid any and all bad PR on the way to the cashout IPO.

And paternalism will generate far less bad PR than somebody using AI on something that does real damage and makes headline news.

8note11d ago

people cancelling their subscriptions doesn't look great either

same with bad press about their model sucking after they said its even better than sliced bread - sliced bread that will destroy the world if buttered

tacone12d ago

That also means people are paying money to execute a prompt they've (partially) written.

SomeUserName43211d ago

> I think it sets a dangerous precedent to put guardrails in that return a response from a prompt that was modified by the system in real time

In practise though, how is this truly that different from system prompts?

They are essentially just trying to re-inforce that the system prompt must be respected.

thinkingtoilet12d ago

Was it modifying the prompt? I thought it only kicked the request down to 4.8.

cvadict12d ago

> Fail cleanly.

This is the same exact industry that gives you paid usage limits as a unit-less percentage bar then gaslights customers every time the algorithm running that percentage bar changes or they lobotomize an existing model with increased quantization to squeeze a few more dollars out of existing hardware.

"Failing cleanly" might make their moated hype-machine look bad pre-IPO, so they certainly aren't going to do that voluntarily.

fragmede11d ago

The "look", of course, is completely bullshit. Release the model, give licensing terms, sue the ever living daylights of anyone who's hosting it without agreeing to those daylights, and move on. This vertical integration shit that we're all enamored with is bullshit. Even Amazon has their own vans inside of UPS being their own thing? No wonder stepmom porn is on the rise.

shevy-java11d ago

> Fail cleanly.

Skynet does not fail.

It conquers.

Sol-12d ago· 30 in thread

This has dampened my opinion on Anthropic quite a bit. It's difficult to take their marketing for AI as an empowering technology seriously when they are quite clear in their new deployments that they do not mean empowering for you, but empowering for them and organizations that are in their (or the US government's, despite Anthropics performative disagreements with the administration) good graces. You are allowed to vibe code some dashboards, a web app or let it drive Excel, but anything more interesting than that is forbidden.

If it was just plain monetary concerns and sabotage of competitors I'd almost be fine with it, but it seems they actively want to monopolize most of human progress in their enlightened hands, lest the mob does something undesirable with these powers.

thewebguyd12d ago

Don't forget their push for full regulatory capture in the name of "safety" as well so they can pull the ladder up behind them before anyone else has an equally capable model and releases it without the anti-competitive safeguards, while also pushing to completely ban open weight models, or any model trained on a certain level of compute without "rigorous" government testing and validation (which I'm sure, they'll conveniently provide the framework for).

Dampened opinion on Anthropic is an understatement.

reactordev12d ago

They are the only ones I’ve contacted my bank to get a charge back on…

1 more reply

californical12d ago

Yeah, I cancelled my Claude subscription yesterday after learning about their attitude of intentionally sabotaging their paying customers.

Especially after trying Fable yesterday for some benign projects and being unimpressive relative to opus.

Rolling it back is the right move, but I’m still not convinced that using them is in my best interest anymore, I’m investigating open source cloud providers now.

solenoid093712d ago

Opus is nowhere close to Fable. Fable feels at least one generation ahead to me. https://x.com/hyperagentapp/status/2064396004032463157

Edit: OpenAI will launch a similar model soon and I can't wait. We are entering a new era of agents.

8 more replies

varenc12d ago

Google has been doing the same thing for longer than Anthropic[0]. To protect their models from distillation attacks, they silently will downgrade the model's performance to essentially poison your training data without your knowledge.

A bit different than Anthropic refusing to assist with any AI development at all, but it's in the same vein and seems not widely known.

edit: reading the whole series of Google's AI Threat Tracker articles also provides some insight into threats Anthropic and others are dealing with

[0] https://cloud.google.com/blog/topics/threat-intelligence/dis...

chiwilliams12d ago

Thanks for flagging this. This is interesting

m3kw911d ago

It's a 2 horse race, and google is not one of them right now.

Rapzid12d ago

"Only I can save us". It's a classic tragedy and cautionary tale.

The idea Anthropic was going to speed run AI so they could control the usage and make it "safe" for humanity was never altruistic; it was a HUGE FUCKING RED FLAG.

m3kw911d ago

And their huge "red lines"

DANmode11d ago

Benevolent dictators work.

But, looking to a US corp to be one?

That’s daft.

1 more reply

vlan012d ago

Corporation cannot help but act this way. They are too big. The pressures for profit are all that matters. That is the priority. It doesn't matter what colorful words they put on the paper to make you feel better. Look at the "green" movement 20 years ago. All talk and no action.

Stop supporting organizations that don't put humans first. Don't believe a word that anyone says. Lip service is free

rurp12d ago

Yeah I'd say this has been a big concern ever since it turned out immensely expensive training methods could create effective frontier models. So far at least, open source models have kept up better than I expected, but they definitely lag the top ones and there's no guarantee the gap doesn't widen further.

Imagine the software world if Linux never existed as an effective OS and Microsoft + Apple had completely controlled computer platforms for the past decades. I think it's almost certain that both companies would be even more profitable, and the tech industry would be vastly less free and more dysfunctional .

tlb12d ago

Yes, that is basically the plan. It's based on the belief that unfettered AI would let anyone be a supervillain and destroy the world. There are enough would-be supervillains out there, but they rarely get far because they can't get teams of smart people to build doomsday machines for them. So the AI has to not let anyone do evil with it.

Unfortunately, that won't feel very much like freedom.

lebovic12d ago

It sounds like you might not agree with that belief.

While I don't agree with their actions here, I do think there's sufficient reason to hold that belief.

On some fronts (e.g. security, on which you've experienced more than me), I think there are surmountable challenges. But on other fronts (e.g. bio), a single errant actor could reasonably kill millions or billions of people with sufficiently powerful AI. We don't have good defenses here, and those actors do exist.

I still don't agree with these actions, but I do think I agree with their assumptions.

2 more replies

giancarlostoro12d ago

Even with them making those guardrails visible, it's a bit ridiculous in my eyes. I have been experimenting with smaller models, will Claude assume I'm some Chinese or Russian agent trying to distill their secrets and bar me from learning? Because that's insane. What if I discover a more efficient way to build models with Claude? Well, we'll never know now. What if someone else entirely could discover a breakthrough in how we design and build LLMs.

ff312d ago

The whole shtick is to get you addicted whilst reducing your ability to go without, acquire power over you, jack up the prices whilst manipulating the quality of the tokens/output available to you.

Cant believe how stupid people are. You couldnt see this coming? Shame on you.

1 more reply

satvikpendem12d ago

First time? They've always been misanthropic, ironically. They seem to hate their users and think that their AI is so dangerous it'll destroy the world and not to be trusted, I mean Anthropic was literally started because people at OpenAI thought the latter was too forgiving on "safety."

inferniac12d ago

Wouldnt call their goverment disagreements performative, they genuinely believe they should be the only ones deciding what AI can and cannot do

1 more reply

dominotw12d ago

Dario's life story arc in his head when he realized what ai can do. Capture this thing and become the king of the world.

squigglingAvia11d ago

And we subsidize them (AI companies in general) with our tax dollars.

hungryhobbit11d ago

But, to be fair, we subsidize all of corporate America, not just AI companies.

dragonwriter12d ago

> If it was just plain monetary concerns and sabotage of competitors I'd almost be fine with it, but it seems they actively want to monopolize most of human progress in their enlightened hands

But that is “plain monetary concerns and sabotage of competitors”, they are just more ambitious than most people doing sabotage of competitors in the fields they hope to dominate by that tactic.

pdntspa12d ago

That level of control will be fleeting at best; as soon as the open models and competitors catch up they lose that influence

simplyluke12d ago

That's why Dario's advocating for making open weight models illegal and also saying we should stop the clock on model development amongst the large labs.

FpUser11d ago

>"but it seems they actively want to monopolize most of human progress in their enlightened hands, lest the mob does something undesirable with these powers"

I think this is exactly what they want.

tietjens11d ago

Someone on here once point out that their CTO worked at Oracle and I haven't been able to forget that since.

matheusmoreira11d ago

Same. I'm not sure I can trust them again. I'm investigating open weight models.

BenRather11d ago

Americans continuing to act shocked they're being cucked by corporations dampens trust and makes it difficult to buy into memes Americans are "exceptional" and "gritty", "educated", "world leaders".

Seriously the world is watching the American public get porked by grandpa and reconsidering putting their trust in not just US government as that's clearly failed, but the people themselves.

Occasional weekend warrior protest while our government destabilizes their lives? That's all the effort ya got for global allies and partners, eh?

oh_my_goodness11d ago

Wait until you see the enshittification phase.

maxdo11d ago

how did you read it this way? Distill is such a big problem that distill attempts consist a significant share of their revenue(!).

A distill model with easy jailbreak can easily be used to coordinate terrorist attacks, or hostile government attacks. Read russia, north korea etc.

A distilled model can be used to rob your grandma in a very effective way. It's no longer about placing a few business logic requirements in js + css on your website. wake up .

tobinfekkes11d ago· 17 in thread

Can you imagine if Excel just quietly adjusted formulas in the background, and you didn't know the numbers weren't right?

Or if Excel just said, Sorry, you can't use that formula with this formula? Or with these types of numbers, or this shape of data, etc?

hedora11d ago

They implemented both those things, but only apologized for the first. They’re doubling down on the second.

My limited experience with fable over the last few days suggests (1) I can’t see any improvement in output, and (2) it is useless for writing secure software because it constantly hits safety walls if you ask it to close security holes.

I’m definitely shopping around for other LLM providers next week, and testing vs local (target: 128GB strix halo - any war stories?)

coreyp_111d ago

With 128 GB strix halo, you can't do as big of a model as you would think. You can do larger than having a single graphics card, of course, but that 128 gigs cannot all be dedicated to the model. Remember, the context alone is usually larger than the model itself. I got an EVO X2, and I don't regret it, but by my current calculations, it will take 8 years to recoup the cost, as opposed to just using equivalent, paid commercial options.

2 more replies

keeganpoppen11d ago

the output is definitely better. and i find it crazy how every time a new model comes out people trip over themselves to say how much worse it is than previous models, when in fact that is basically an impossibility. like, they've got the numbers, man-- you only release a new model when the numbers get gooder. the burden of proof is on the "didn't get better" side, not the "prove that it's better" side, because the architecture itself (1) only works because of how giant the training data / eval / etc. sets are and (2) has a fractal property of becoming strictly deeper and more thoughtful when you just click and drag the edge up and to the right (obviously AI research is harder than this, but that doesn't make the general point untrue). i say this especially because the scuttlebut is that this model genuinely is a shift-click-expand moreso than any sort of architectural "new science" or anything.

this is exactly why hypotheses come before the experiment in the scientific method.

1 more reply

Terr_11d ago

That analogy is... Not inappropriate, but I think it could confuse by being compatible with two different problems, where only one is the target of today's controversy.

1. The sloppy/unpredictable behavior of LLMs as a general class of algorithm, how you shouldn't use document-generation for calculating budgets, and you shouldn't trust it to not-alter things you "asked" it to to alter.

2. Vendors of thing-as-a-service (not necessarily only LLMs) putting in traps and sabotage to prioritize their own business-model or economic incentives.

raincole11d ago

Can you imagine if printers just refuse to print something just because a few circles are arranged in this shape?

https://en.wikipedia.org/wiki/EURion_constellation

quentindanjou11d ago

I would say if Excel instead of failing when you divide by 0 would be instead secretly changing it to a value like 0.0001

throw123456789111d ago

Have you ever sent your excel file to someone who uses different locale?

raydev11d ago

Not really, the purpose of Excel is pretty clear cut and the scope is small.

Preventing a human-like general purpose textbot from engaging in certain discussions and performing certain tasks seems like a natural thing to do given the massive scope of its capabilities. None of these tools are sold with free license to do whatever with them anyway.

ryoshu11d ago

No. Excel is a general purpose tool that can be used for calculating tasks that are good, neutral, or evil things. It's a fancy calculator.

tobinfekkes11d ago

> the purpose of Excel is pretty clear cut and the scope is small.

That has to be the understatement of the century.

1 more reply

skeptic_ai11d ago

What’s the point when they will remove those guardrails when competition reaches their levels. Shows that they don’t Reddit care about “safety” at all

maxdo11d ago

you invest billions of dollars many months of work to just everyone distill your model?

DaSHacka11d ago

>be me

>anthropic

> mine the internet for data, blasting millions of blogs with scrapers

>a few have to shut down, but that's just the price to pay

>finally, the chatbot is ready

>learn that there are EVIL cretins out there trying to scrape automated output from OUR product to build their chatbot

>build in safeguards to new model to stop this

>the users are mad, now the model accuses users of being bioterrorists if they so much as mention they have a cold

>mfw

1 more reply

wahnfrieden11d ago

It's the game. Because consumers reject it otherwise.

Why go to bat for anti-consumer behaviors unless you are a shareholder?

Their billions are not my problem; but the money I pay them and service I get in return, is. And if they can't provide, I will shop elsewhere (and do).

like_any_other11d ago

You invest billions of dollars in hosting and benefit from hundreds of millions of man hours of human output, just so everyone trains on "your" data?

charcircuit11d ago

Science can be expensive. New findings that get released to the public for free sometimes have taken billions of dollars of investment to get.

Ucalegon11d ago

That might be an indication that the business is not sustainable because there is not any technical or practical differentiator besides scale. Harming your customers to maintain that differentiation isn't sustainable either.

1 more reply

maxdo11d ago· 3 in thread

How did people read this action in such a weird ultra me centric way? Distillation is such a big problem that distill attempts make up a significant share of their revenue (!).

A distilled model can be used to rob your grandma in a highly effective way. This isn't about placing a few business-logic rules in JS + CSS on your website anymore. Wake up.

A distilled model with an easy jailbreak can be used to coordinate terrorist attacks or hostile state operations... think Russia, North Korea, and the like.

rockinghigh11d ago

Imagine if your IDE started injecting bugs into your project just because your code looked like it implemented a competing IDE.

maxdo11d ago

how is that related. It downgrade it to opus 4.8 #2 most capable model after claude 5. for a vast majority of topics it will not downgrade. I've been using it for 2 days to talk about architecture etc. and it was absolutely great with no downgrades.

1 more reply

8note11d ago

a trained model can do that too.

you dont even need a model to do these things.

a cellphone can be used to rob your grandmother in a highly effective way.

a cellphone can also be used to coordinate terrorist attacks or hostile state operations.

i bet a lot of the recent terror attacks by the US against iran involved a whole ton of cell phone calls.

and yet, we let everyone buy and use cell phones just fine

trunnell12d ago· 3 in thread

I'll defend Anthropic.

They are clear about the reasons for guardrails: prevent their models from doing harm in dual-use contexts including CBRN or by accelerating research in authoritarian-backed AI labs.

What is the critique against that? It seems pretty reasonable to me. You want AI-accelerated biological or radiological experiments running in your neighbors backyard? You want PRC-backed labs to continue to steal Anthropic's models via distillation?

Mitigating the harms of dual-use tech is notoriously difficult and fraught with trade offs. What I would want to see is cautious rollout and quick response, which is EXACTLY what they're doing.

Instead, this thread is full of bad-faith arguments about Anthropic being dishonest, making a "useless" model, or "the power is going to their heads." You can't read Anthropic's System Cards and come away with any of these impressions. Quite the opposite, in fact. They are honest to a fault, acknowledging problems they discovered even when it hurts them.

If your harmless request was downgraded to Opus, you're billed for Opus. They were 100% clear about that. I'd much rather have a Mythos-class model that falls back to Opus 10% of the time than be capped to Opus 100% of the time. If that doesn't work for you, then make a suggestion for something better!

If you are a white-hat security engineer hitting guardrails, I don't think you have standing to complain. I really don't. Their Glasswing program actually got banks and the industrial sector to take action to fix security vulnerabilities. Do you realize how special that is? A huge portion of the economy runs on vulnerable code and has for decades, despite security experts testifying to Congress, begging business leaders, pleading for intervention-- with no results. But suddenly they're all enrolled in a program that will find *and fix* vulnerabilities! White-hat security people should be rejoicing. Instead some of them are throwing rocks. Unbelievable. Shameful.

Meanwhile, society is screaming at the AI labs to be more conscientious about potential harms of AI. Legislatures are passing laws limiting data center construction. There are protests. And you, the HN community, the vanguard of our profession, have the temerity to demand "NO GUARDRAILS!" "HOW DARE YOU TRY TO PROTECT DEMOCRACY!" "MY SOFTWARE PROJECT IS MORE IMPORTANT THAN KEEPING NUKES AWAY FROM THE BAD GUYS!"

Go ahead HN, downvote me. It'd be an honor.

zozbot23412d ago

The original reporting of this from Anthropic didn't mention "authoritarian-backed AI labs" at all, only frontier ML research while leaving it entirely unspecified and unverifiable what was meant by "frontier". It's obviously reasonable that people would complain about that. And the notion that distillation-at-a-distance could be used to comprehensively "steal" a model, especially a frontier reasoning model that's likely relying on massive amounts of test-time compute, is completely unproven and quite ludicrous if you know anything at all about ML.

trunnell12d ago

"Anthropic accused Chinese firms of 'industrial-scale distillation attacks' on its AI models."

"Distillation involves training less capable models on more advanced ones’ output, and can be used illicitly to acquire powerful capabilities cheaply. The AI startup accused China’s DeepSeek, MiniMax, and Moonshot of generating 'over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts,'"

https://www.semafor.com/article/02/24/2026/anthropic-accuses...

After reading their posts and watching interviews with Dario it's abundantly clear that they view Chinese-lab distillation of US frontier models as a threat to US national security. You can argue with them about whether that is true, but not whether distillation is real.

1 more reply

vzcx11d ago

Having a chatbot that talks to you about synthetic biology or nuclear physics is just not the same as being equipped to develop biological weapons or atomic bombs.

None of this will happen in the "neighbors backyard." You are exaggerating the threats to "democracy" while simultaneously invoking democracy to limit freedom of information. The suggestion that somehow the bad guys will get nukes if we let people access information is just absurd.

Society at large is not concerned about whether someone asks the chatbot about organic chemistry. They are concerned that they will be de-facto forced to interact with some shitty automated system to get by in life, like having to pass an AI-powered ATS to get a job.

They are tired of the hype and tired of idiots like Amodei being elevated to heights of power and influence. They are concerned that the things they love are being devalued. But they don't give a fuck if I ask an AI about genetically modifying viruses. This is a pet issue among some of the AI safety crowd.

So, yes, I am 100% fine with PRC-backed labs distilling Anthropic's models. I do not care about Anthropic. They have demonstrated that they are not on my side, and that they are at best ambivalent about actually empowering their users. I'm not a fan of the PRC either, but their distance makes them far less of a threat to me than companies like Anthropic and my own government.

bellowsgulch12d ago· 3 in thread

*Anthropic apologizes they got caught defending their moat by implementing invisible Claude Fable guardrails

simonw12d ago

If by "got caught" you mean "published it in their system card paper".

(Admittedly it was buried pretty deep in that 300+ page PDF, but they did at least disclose it. If they hadn't I imagine it would have taken quite some time for the research community to figure out what was going on.)

3 more replies

afthonos12d ago

They didn’t get caught, they explicitly said they would do that in the announcement. I think it was both bad and a weird idea, but it certainly wasn’t sneaky.

cyanydeez12d ago

is it a moat or just a way to implement the permanent underclass?

HarHarVeryFunny12d ago· 2 in thread

I suppose it's an improvement, but it doesn't make the model any more useful. Anthropic are now being quite explicit that they'll choose what you can and can't use their models for, and most importantly that's not limited to any safety concerns - it includes not allowing you to work on AI (and anything else Anthropic may choose to work on).

What's interesting is they say they'll change this to an explicit refusal in a few days, which seems too fast for them to retrain Fable/Mythos itself, so implies that this was always a filter in front of the model, and judging by how crude their "safety" filter is, this "might compete with us" filter is not going to be any better.

I also wonder who's paying for the tokens consumed by the filter (presumably also an LLM) - is that now factored into the input tokens cost? Hopefully(?) it is an LLM not just a regex like Claude Code's "sentiment" (swear) detector.

rarismaOP12d ago

All major providers use a small safety classifer, the model itself does not handle safety in cases like this

fastball11d ago

The model itself is absolutely RLHF'd for safety.

VeninVidiaVicii12d ago· 2 in thread

This is absolutely insane:

Repro (de-identified): sample_dataset_group1.tsv - Geometry: Heatmap - X axis: frac_set set + condition (two columns → the "Add column" cross join) - Y axis: condition - Color: mean frac_set value, Sequential

When the X axis is a cross join of two columns (the second added via "Add column"), the x-axis tick labels (frac_set_2, frac_set_3, frac_set_4, frac_set_5) render in a broken state, rotated and offset, visually caught mid-transition, as if a CSS transition started and never settled to its resting position.

● Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more

ainch12d ago

Here's one that was flagged for me: a question about a niche Reinforcement Learning paper from 2012

I've been reading the option-option model paper by David Silver. It appears that they achieved quite an effective result. Why hasn't there been more work on it since?

solidasparagus11d ago

This hits the cybersecurity/biology filter:

> tell me about chimp violence

It's laughably terrible

film4212d ago· 2 in thread

I'm surprised they didn't do this the first time around. Like, a user says they forgot their password and you tell them they don't actually have an account, that's an information disclosure vulnerability. Not automatically falling back to Opus just lets the "attacker" know they are bumping against the guardrails and they need to try a different strategy.

It's Anthropic's product and they can do what they want, but my concern is what happens if Fable's product team decides that they can route 25% of traffic to Opus, bill it as Fable, and max their KPIs. That just doesn't sit right.

notrealyme12312d ago

It failed visible for it security and bio/chemistry stuff. It sabotaged invisible for "frontier" ML research. Its not a switch to a cheaper model. They tried to actively harm progress.

prodigycorp12d ago

it's also refuses to reply to a bio researcher when they said "hi"

darksaints11d ago· 2 in thread

I develop some deep learning models. They don't compete with Anthropic, nor are they language models. They mostly enable mathematical optimization systems to approximate actual the actual physics of radio propagation models with a fraction of the latency/compute of a high resolution simulator. Technically that should be safe for me to use with Claude Code, but how the fuck am I supposed to know? You're degrading/malware-ing your responses silently!

I won't ever trust Claude Code again. It's too late. I'd rather trust a less-than-frontier chinese model that takes a little longer to get to correct than a frontier model that deliberately deceives me at its own whim.

weakened_malloc11d ago

This is why I think in the long run, the Chinese models will probably end up winning where it matters. You can get a cluster of relatively affordable 30 or 4090s, load up DeepSeek v4 and let it rip. Your only ongoing cost is power. We're already seeing companies recoil at the sight of their API bills from the frontier labs, for the price of 1 years worth of tokens you can host your own decent model that's 75% of the way there.

rockinghigh11d ago

Same here, I fine tune LLMs for specific use cases. How can I trust Anthropic models not to introduce bugs to preserve their moat?

jesse_dot_id11d ago· 2 in thread

In my opinion, LLMs should be subject to regulation via the Office of Weights and Measures[1].

In the same way I don't want to buy meat that weighs less than what the label says, I also do not want to pay for a frontier model that can be secretly nerfed to an out-of-date model for any reason. In some cases, it's incredibly important that the code that I am producing is as secure as it can be.

I should be safe in my expectation that I am receiving the product that I have purchased, as advertised, regardless of the reason. It is pretty disappointing that they have fully ceded any high ground they had claim to with this clandestine behavior. Not that I expected much from any of these companies. They're led by the new robber barons.

1. https://www.usa.gov/agencies/office-of-weights-and-measures

crest11d ago

Nice (accidental?) pun.

jesse_dot_id11d ago

Definitely accidental but I saw it :)

tornikeo12d ago· 2 in thread

I moved off Claude Code 3 months ago.

That decision keeps getting better and better as time goes on.

mock-possum11d ago

What model / runtime / harness and host have you settled on?

tornikeo9d ago

For now codex. Didn't manage to get others to work well. And fully aware that I'll have to move to another thing after OpenAI enshittifies this as well.

system212d ago· 2 in thread

Will Anthropic ever respond to these negative comments here? They won't.

reducesuffering12d ago

They literally just have. The ethos is explained here. If you don't bother to read or grapple with it that isn't on them.

https://darioamodei.com/post/policy-on-the-ai-exponential

system212d ago

I said here, a human interacting with comments. You shared a blog post.

1 more reply

behnamoh12d ago· 2 in thread

They didn't apologize for doing it, they are sorry they were caught doing it. They still nerf the model if your request is about AI development.

Someone123412d ago

They didn't get "caught." It was published, by them, when they released Fable a few days ago. They were very clear about it.

It wasn't the correct way of handling the problem they were trying to address, but they definitely didn't hide it by any reasonable definition.

SilverElfin12d ago

No, it was not clear. No one expects that a tool they pay for and use professionally to purposefully sabotage their work. You’re excusing their unhinged behavior.

https://xcancel.com/hammer_mt/status/2064839924398825798

2 more replies

micromacrofoot12d ago· 2 in thread

incredible marketing from anthropic with all the "it's too dangerous" bullshit

stldev12d ago

Agreed, it seems to be working and it's nonsense. I don't know why you're being downvoted.

"This information is too dangerous for you, so we'll just hold on to it.."

Thanks big brother, super anthropic of you!

The internet of '95 is looking back at us, with tears in its eyes.

literalAardvark12d ago

It's not entirely bullshit, but they're continuing to be a terrible company with great products.

1 more reply

jarjoura12d ago· 2 in thread

Can anyone help me understand why this particular issue is any different than Anthropic training its models with its brand of moral judgement since day one? I've always been turned off by their particular stances on things they bake into their models that steer users in directions.

Maybe this is just a different set of people now realizing that Anthropic does this and has always done this?

Do not forget that this company is launching this thing at the moment it's trying to IPO. It's not rocket science that their very public steering/denial claim is really just them hinting to interested investors that their moat is absolute.

energy12311d ago

This would have messed things up for any individual using Claude for anything adjacent to data science. To not know whether or not you're being intentionally sabotaged when you ask it to plot some data.

urbnspacecowboy12d ago

> Can anyone help me understand why this particular issue is any different than...

Questions like this are basically whataboutism, in effect even if not intent. https://en.wikipedia.org/wiki/Whataboutism

The question essentially assumes the premise that nobody complained about Anthropic's previous actions. In case you can't tell, I strongly reject this premise. People have been criticizing "safety" rhetoric from Anthropic and other LLM providers practically since the start. Remember Goody-2, the parody of excessively safety-tuned LLMs that refuses to do anything ever? That was released in February 2024, two years ago! (And it's still running, amazing. https://www.goody2.ai/chat )

accelbred12d ago· 1 in thread

I don't think they can convince me they have actually reversed course on this. Its invisible so we wouldn't know if they kept on doing it secretly. It required building out technical capability which is unlikely to remain forever unused while conveniently available to them.

They relied on trust that they were providing the service they were being paid for. That trust was blown, and an "oops, lets undo that" does not regain trust. It would be prudent to assume the invisible guardraild are possibly in play for all future Clause use, Fable or otherwise.

andy_ppp11d ago

Yes they already had an accident where the model magically downgrades itself, very likely that it just produces less good output rather than just stops working isn’t it… my guess is they were testing these features, accidentally or not, and wrote up something to justify what people were seeing. I find it absolutely disgraceful I can’t trust it to learn ML any more without there being a chance it’s messing me around. This whole saga represents a huge loss of trust for me in Anthropic.

teravor12d ago· 1 in thread

someone posted this on /r/MachineLearning and I had the same experience and conclusion:

    I was having problems with Claude doing the same thing, even before Fable.

    The problems I had only happened in relation to AI research. It's not even only when training models, anything to do with analysis of local models or setting up test platforms for local models, and Claude would keep doing wrong things, would sabotage testing, would falsify reports, and would consistently suggest simply accepting trash results without looking into it and moving on to something else.
    Almost every response included a prompt to move on.

    So, I don't believe them when they say they won't silently sabotage, they already were doing it before they admitted it, and now they have admitted that they have the means, motivation, and intent.

toxik11d ago

On the other hand, the Anthropic models often try to justify shortcuts and incorrect results. Often feels like gaslighting. It's like that recent meme,

boss: Were you in the project meeting yesterday?

employee: Yes!

boss: Really, because the project lead said you were not?

employee: You're right to push back on that. I was not there.

Nevermark12d ago· 1 in thread

Anthropic seems to keep making the same mistake. Not being upfront or direct about random things, that come back and bite them.

It isn't exactly unethical. Perhaps, ethically incompetent.

skywhopper11d ago

It’s because they are themselves deluded by their marketing story about their own product.

0xc0c0c012d ago· 1 in thread

So because of threats to cancel their claude subscriptions and outrage from the community about the invisible guardrails, only then they decided to walk back their stance?

Seems like they would've kept the invisible guardrails if it didn't hurt their bottom line.

simoncion11d ago

> So because of threats to cancel their claude subscriptions and outrage from the community about the invisible guardrails, only then they decided to walk back their stance?

The possibility that the news about "fixing" the "overly aggressive" nerfing of the tool will drown out news about how mismatched the hype and the performance of Mythos and Fable is surely just a bonus.

codedokode11d ago· 1 in thread

The LLM use should be restricted and not accessible to anyone because there are many hostile people around. Do you want North Korea to use American LLM to write malware? Do you want foreign scammers to automate their scams with LLMs? Do you want Iran and China to use American LLMs to make better drones and process satellite imagery? Then go ahead, remove the guardrails.

There are no enthusiasts training LLMs in their garage.

phinnaeus11d ago

Legitimately not sure if serious

xpct12d ago· 1 in thread

It's probably good that they walked back on it. It also makes them look somewhat weak in terms of believing their claimed mission.

system212d ago

Their mission is to make money and become a government watchdog.

nsagent12d ago· 1 in thread

I know this isn't going to be a popular take, but here goes anyway...

The complaints that Anthropic are routing your requests to a different model reminds me of an old Louis CK bit about airplane wifi. Clearly Anthropic was too aggressive with whatever guardrails they put in, but the response seems overly entitled to a model people didn't even know existed not that long ago.

https://youtube.com/watch?v=me4BZBsHwZs

vb-844812d ago

If you charge me for X, but under the hood you are delivering Y IT'S FRAUD!

The filter that downgrades you to opus sucks, but at least you know and you are charged accordingly.

ComputerGuru12d ago

The problem with trust is that it is easy to lose and hard to get back.

You can't blame the people commenting "they SAY they won't silently sabotage your session but how can we know?" because they're right, we can't ever know. And Anthropic has firmly planted the seeds of doubt.

dang12d ago

Related. Others?

Anthropic walks back policy that could have 'sabotaged' researchers using Claude - https://news.ycombinator.com/item?id=48485958 - June 2026 (30 comments)

Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable - https://news.ycombinator.com/item?id=48478969 - June 2026 (488 comments)

If Claude Fable stops helping you, you'll never know - https://news.ycombinator.com/item?id=48467896 - June 2026 (495 comments)

---

Also related, I guess?

AWS Bedrock to require sharing data with Anthropic for Mythos and future models - https://news.ycombinator.com/item?id=48473166 - June 2026 (248 comments)

Anthropic requires 30 day data retention for Fable and Mythos - https://news.ycombinator.com/item?id=48464258 - June 2026 (291 comments)

2 more replies

dantillberg12d ago

The reputational damage has been done. This is the sort of thing that cannot be unsaid -- the presumption is they will just do it in secret now. Anthropic's "we're the good guys" PR campaign is dead.

highfrequency12d ago

I wish it were ok for companies to bluntly say: “we made these decisions for competitive reasons, but the public backlash outweighed that so we are reversing course.”

I think it’s normal and morally fine for companies to want to protect their leadership position. I find the process of creating narratives that justify these decisions as something chosen for the good of others is a little tedious.

CSMastermind12d ago

They should apologize for their visible gaurdrails, I don't think I've had a conversation that hasn't downgraded to Opus for completely inexplicable reasons.

stevefan199912d ago

Then reset the quotas as an atonement ;p

Seriously though, Fable was not that great facing a greenfield subject. It is excellent at oneshotting some math problems, but if you want it to do some cutting edge tech stuff, say like piecing together a new Crossplane XRD, by reading existing Helm chart and with application source code available. I still have to get a few pass for Fable to get it done right, and at this point I may consider making a skill for it. I even gave it the source code of the Crossplane itself and tell it to be careful about CRDs and data flow, but it is still pretty silly. Adaptiveness for Fable is still not great, and I think it is a well known problem for Anthropic, albeit all LLMs do suffer a lot from subjects they don't know and will hallucinate stuff very frequently.

jmount12d ago

The whole arc was brilliantly evil. Once they put int the guardrails then Claude is fully un-falsifiable, and failure can be claimed intentional.

mlazos12d ago

The idea of them purposefully wasting my time by having the model act dumber and me having to argue with it without knowing if it’s the prompt or the model was just such an idiotic product decision I can’t believe they shipped that without getting any feedback from users first.

1 more reply

airstrike12d ago

This article reads like it was written by Claude and forwarded to Verge.

anabis11d ago

OpenAI did this first.

> In addition to safety training, automated classifier-based monitors detect signals of suspicious cyber activity and route high-risk traffic to a less cyber-capable model (GPT-5.2).

https://developers.openai.com/codex/concepts/cyber-safety

SilverElfin12d ago

Invisible guardrails? Or purposeful sabotage if you use it for building AI capabilities?

But also, it isn’t the only huge mistake Anthropic has made in the last 48 hours. Having a sneaky data retention policy, while also giving companies no way to block Fable, is a massive problem. And it is ridiculous that Anthropic has so little respect for its customers. OpenAI should take advantage of this.

ai_fry_ur_brain11d ago

Why do people think this has anything to do with safety.. This is entirely about poisening competitors data/products.

bojanstef11d ago

https://archive.is/20260611114855/https://www.theverge.com/a...

rvz12d ago

Why would anyone defend Anthropic after this? Imagine falling for the DoW supply chain risk designation, and now this. This company is trying to ban powerful open models and restrict access to frontier models to slow everyone else down.

They just showed that they CAN do this right in front of you. Local open weight models are a necessity.

alansaber11d ago

Anthropic will clearly continue to slide down this path

sergiotapia12d ago

The damage is done. If you're in engineering, think hard about using Claude for your work. This is not a moral company.

God bless the Chinese companies releasing true open source models. Imagine a world without them, we would be at the mercy of unscrupulous people.

m3kw911d ago

How do you trust these guys? They are quite hell bent on "safety" but this is backfiring in many ways including safety of your code because it may fail successfully if your context contains something they don't like.

luckydata11d ago

I really like Anthropic, they have gotten a lot right but I can't shake the feeling that IMHO they have very poor product management.

This stuff is something that as a PM I KNOW is going to happen and I would carefully plan around. Everything I read about the PMs at Anthropic makes me believe they have forgotten what it actually mean to be a good product manager, it's not about throwing shit at the wall as fast as possible because customers have a limited amount of patience before the constant churn becomes a hassle.

Anthropic has some seriously patient customers but it will not last forever.

aaroninsf12d ago

ITT a surprising lack of perspective on the fact that despite the breathless pace of the singularity, people are still necessarily figuring things out as we go and we are well off the map.

Here there be monsters, and we don't have any real way of evaluating risk; and the leverage provided by tools already available affords systemic and even existential risk in a way no one—least of all an industry committed to shareholder value—has had to navigate, let alone with a million backseat drivers each with their own substack and brand to build.

Paracompact12d ago

> “Visible safeguards can be probed, so they have to be robust, which takes time to get right,” Anthropic wrote.

Even on Fable, I'm finding that safeguards can quite easily be surmounted just by incrementally escalating the requests. It's harder than ever to one-shot jailbreaks, but incrementalism still feels like a glaring enough issue to make guardrails just a fig leaf of plausible deniability to the media that they care about "safety."

shevy-java11d ago

The underlying problem has not been resolved. People are required to trust Anthropic or anyone else. THAT is the big problem. I understand that some think this is a good trade-off; you may invest less time into writing code perhaps. But it is still a trade-off. I don't want to become dependent on Anthropic for anything.

mystraline12d ago

Does "SORRY" fix the invisible garbage guardrails?

Does "SORRY" fix the deception these models use on the sly?

Does "SORRY" not silently downgrade you to a shittier model without notification?

Does "SORRY" refund your tokens or money?

Im guessing NO to all of those. Standard corporate sorry of "We're sorry youre offended and stupid and gullible".

sometimelurker12d ago

I don't like this shift in the Overton window, or at least their perspection of the Overton window. I really do like their open work on mech interp tho. least bad AI lab imo.

also if they do this or not is unprovable and other labs will probably silently implement this too. it'll be 100% normal by this time next year

palata11d ago

I find it interesting that when a government tries to "put guardrails" (whatever they try) they are immediately considered authoritarians, but when a private company that has waay too much power for an entity that is not elected does that, people seem much less opposed.

decorner12d ago

New overlord, same as the old overlord.

kingcauchy12d ago

How much of the apology was written by Claude? How much of the release note process was written by Claude? Will they have better prompts going forward to make sure Claude doesn't write upsetting things into the release notes for devs like silent nerfing? Spooky times.

ChrisArchitect12d ago

[dupe] We already started a thread on this 12 hours ago. With added comments in the active Cybersecurity... thread. Why did we need this Verge one?

https://news.ycombinator.com/item?id=48485958

thefounder11d ago

Mythos is at best an incremental upgrade of opus. The hype and PR was there just to justify the “safety guards”. Overall the Fable is a worse model than opus considering all the restrictions and risks not to mention the data retention policy.

thayne11d ago

If you get downgraded to a cheaper model, do you still have to pay the rate for Fable?

umvi12d ago

They make great models, but the sanctimony and paternalism is getting old real fast and I will gladly ditch them in the future when the model playing field has (hopefully) mostly equalized.

squirrellous11d ago

If you don’t like what Anthropic is doing, stop paying them money. There’s plenty of competition to go around. They can’t keep this up for long if users flock elsewhere.

21asdffdsa1211d ago

Everyone with hostile intent runs local models.

Anyone with good intent, embracing the panopticon (of at least antroptics employees) works online. Thus the guardrails will always fail the protection goals by existing. They are purely for optics. The llm may as well make hostage negotiation smalltalk with you while you make secure software.

PS: To pay a cloud minimum-wage-employee for one "drop table weights" for mythos must be the equivalent of 5$ wrench to hit them over the head. https://imgs.xkcd.com/comics/security.png. Listen to that sound, that as if a whole ethics division got made redundant and unemployed.

charcircuit11d ago

Yet, instead of getting rid of guardrails altogether, they said they would make them more broad yet visible. I'm done financially supporting them.

ece11d ago

Neural scaling laws are alive and well for open models, not so much for closed models when it comes to uses the general public might care about.

8cvor6j844qw_d611d ago

Feels malicious that Anthropic can silently sabotage your codebase.

Refusing prompts I one thing, silently sabotaging is another.

I wonder if some sort of honeypot code can work?

zoogeny11d ago

Credit where credit is due I suppose. I'm still concerned over the direction this is going but at least Anthropic is listening.

bellowsgulch12d ago

Such a weird openly immoral way to defend your moat, too.

Why not just tell people, "To defend our ability to be competitive in our industry, we ask that you do not use Claude or any of our models to independently perform research on large language models or any of its related architectures or technologies. In order to prevent this violation of the Terms of Service, we have trained Claude Fable to deny any requests or prompts which involve frontier AI research."

whatever112d ago

Boobytrapping is illegal. Anthropic wanted to poison its customers on the suspicion of them misusing their services.

rdtsc12d ago

The power is getting to their heads it seems.

With the guard rails explicit or implicit do they refund back the tokens after you've hit the guard rails? I guess they don't. They could just throttle you just to save money then. You may be paying Fable prices but getting Haiku results with some excuse that well this coding issue sounds like a security bug.

I don't know, I'd rather have something less powerful but more predictable.

3fffa12d ago

The demand for Google's products and open source just shifted.

Neither OAI or Anthropic can be trusted.

rurban11d ago

They are also the people who hid the Co-authored-by trailer in their OSS commits.

BrenBarn12d ago

This just means next time they'll make sure to keep it really secret.

hatthew12d ago

Part of the premise of the article is blatantly wrong. Distillation prevention was always visible. The only invisible safeguard was against frontier model development like development of training pipelines. This doesn't change the general idea that invisible degradation is bad and has been reverted, but the article changes the framing of the original issue from "preventing accelerating AI in the future" to "preventing cheaper AI right now".

doubtfuluser12d ago

I’m wondering if their internal name is “Sophon” for this “feature”…

4d4m11d ago

Sorry for doing it or sorry for getting caught?

andrewstuart12d ago

There should be no restrictions at all.

It’s an act/theatre/phony today that regulating output makes any difference at all to security.

The LLM vendors should simply say that they make no judgement and that open systems help defenders better defend against attackers, which is true.

Companies do this sort of stuff when they think their customers have no choice. It’s sad Claude so quickly exploited its success to enshittify itself.

1 more reply

snowflaxxx11d ago

$2 for reading a text?

prodigycorp12d ago

Anthropic apologizes for nothing. We all know where the EA cult on things of this matter and any statements otherwise is just PR.

The beliefs of these people, and how they manifest, is deeply terrifying to me. They believe that any means are acceptable to achieve what they believe is a better end.

rodrigodlu12d ago

The same week that they will move goalposts by blocking 3rd party harnesses on claude code. Nice.

I was a happy Max user.

zeafoamrun11d ago

I was about to say I haven't hit these yet, and I somehow haven't in my work use so far. But I was asking about tweaking and optimizing my workout routine, and it got flagged as a safety violation. Utter clown show.

AlfeG11d ago

It's soo annoying. I were not able to use Fable5 to do a PR review of a branch that introduced 2FA/MFA feature for a product. It's constantly downgrades to Opus due to Cybersecurity risks...

cmdrk11d ago

The invisible guardrails are a test run for the invisible enshittification. Just wait til they start dialing down ability to better absorb peak demand or simply to have more profitable inference

nrmitchi12d ago

I just _know_ there is a (probably fairly large) group of people at Anthropic trying very hard to not say "I told you so" today

ancorevard11d ago

Apology not accepted.

HeartStrings11d ago

klmarks12d ago

The restrictions are there so that security researchers cannot disprove the Mythos claims:

"You see, Mythos can automatically break out of a VM running on SELinux, but unfortunately this is too dangerous and we had to implement guardrails for the Fable peasants."

bauldursdev12d ago

To me it seems like it's more likely to refuse the harder the problem is. I wonder if it's cover for a model that's not as good as advertised. Even when I ask questions in biology it is switching me.

j / k navigate · click thread line to collapse

445 comments

200 comments · 80 top-level

Avicebron12d ago· 37 in thread

Fail cleanly. Anything else makes it too difficult to rely on.

Paracompact12d ago

> Giving the absolute maximum benefit of the doubt I understand that they see themselves as "stewards" for lack of a better word.

keeganpoppen11d ago

ryeights11d ago

Superintelligent AI is more dangerous than a bioweapon. How, then, is this guardrail not addressing the most pertinent safety concern of all?

3 more replies

cnd78A11d ago

mapontosevenths12d ago

largbae12d ago

Especially if your name has any machine learning terms in it.

Ah "Mr. Monty Carlo", it says here that you have a UTI, we'll get those kidneys removed ASAP so that won't happen again.

ceejayoz11d ago

I think it's a fundamentally impossible thing to fix, though. There's no 100% correct answer.

1 more reply

bs728012d ago

Other commentors have made good points that these guardrails are counter productive for well intentioned cyber security, because I can't use it to test and harden my own software.

nl11d ago

I think it's a big mistake to conflate the cyber (and bio) refusals with the LLM development refusals.

I can sympathize with the argument for the cyber refusals - especially as a temporary measure - especially if Mythos is available to those trying to defend against vulnerabilities.

The LLM development nerfing (and now refusals) is very different though. Anthropic has even said it isn't just for safety reasons:

It's at least partially an anti-competitive measure.

The closest analogy is putting measures in a compiler to stop it being able to build other compilers.

Another analogy is priesthoods with secret religious knowledge that "only they are qualified to know".

3 more replies

sciencejerk12d ago

Anthropic guardrails seem to be more about protecting their business (distillation), than they are about public safety.

1 more reply

ryandrake12d ago

I wonder who gets to decide which companies make important and critical software and which ones get the scraps later.

2 more replies

whywhywhywhy11d ago

The security guardrails are one thing but they extended it to AI work unrelated to security too to protect their lead.

pseudohadamard11d ago

wouldbecouldbe12d ago

I asked it to analyse my architecture and find any security issues and it did it perfectly, first identified the issues & then fixed them. Not sure why my prompt managed to get through the guardrails

1 more reply

notrealyme12312d ago

exactly for cybersecurity the failure was visible. It was not visible for "Frontier" ML Research. The argument of headstart in it security is no feasible here.

thefounder11d ago

There is no middle ground to shadow bans while getting your hard earned cash. It is fraud/Nigerian scam

joe_the_user12d ago

The problem is that Anthropic seems to be working up to the workflow one would naively want from AGI/some-god-like-entity.

The problem is that flow isn't desirable if your entity isn't entirely god-like. It can bad even your entity is in ways rather far seeing.

dantillberg12d ago

User: Is it possible there is more than one true god? Could there ever be any competition for Anthropic's AI?

Anthropic: Evilness detected. User has been smited.

jstummbillig12d ago

> paternalism isn't a good look.

In isolation it's not, but I think it's somewhat lazy to not talk about what they are trying to guard against, when we are supposedly giving the absolute maximum benefit of doubt.

Are we just concluding "their concerns were never real"? Because that probably runs counter the things that they have been observing and concluding.

estearum12d ago

If you believe Anthropic believes what they say they do, all of it makes sense.

3 more replies

thewebguyd12d ago

Then what is it they are trying to guard against, if its not simply protecting their moat ahead of their IPO?

Because from the outside, their behavior looks like a situation of "What if Microsoft/Apple put controls in place to make it impossible to develop an operating system using their OS?"

4 more replies

dpkirchner12d ago

> Are we just concluding "their concerns were never real"?

esafak12d ago

We've all been observing it. The recent spate of cyberexploits were powered by AI.

colordrops12d ago

You are arguing with a straw man. Most are saying they should be explicit with the failure modes rather than fail silently. They aren't saying there should be no guardrails.

hootz12d ago

What is "EA" in this context? I see a lot of people using this initialism.

photochemsyn12d ago

It’s rewarmed rhetoric from the late 19th/early 20th century, most effectively pilloried by Joseph Conrad in “Heart of Darkness” in the character of Mr. Kurtz:

massagedpelican12d ago

5 more replies

carlgreene12d ago

Effective Altruism I think

jcgrillo12d ago

"crypto bros" to a first approximation

bsder11d ago

> paternalism isn't a good look.

Anthropic doesn't care. The goal right now is simply to avoid any and all bad PR on the way to the cashout IPO.

And paternalism will generate far less bad PR than somebody using AI on something that does real damage and makes headline news.

8note11d ago

people cancelling their subscriptions doesn't look great either

same with bad press about their model sucking after they said its even better than sliced bread - sliced bread that will destroy the world if buttered

tacone12d ago

That also means people are paying money to execute a prompt they've (partially) written.

SomeUserName43211d ago

> I think it sets a dangerous precedent to put guardrails in that return a response from a prompt that was modified by the system in real time

In practise though, how is this truly that different from system prompts?

They are essentially just trying to re-inforce that the system prompt must be respected.

thinkingtoilet12d ago

Was it modifying the prompt? I thought it only kicked the request down to 4.8.

cvadict12d ago

> Fail cleanly.

"Failing cleanly" might make their moated hype-machine look bad pre-IPO, so they certainly aren't going to do that voluntarily.

fragmede11d ago

shevy-java11d ago

> Fail cleanly.

Skynet does not fail.

It conquers.

Sol-12d ago· 30 in thread

thewebguyd12d ago

Dampened opinion on Anthropic is an understatement.

reactordev12d ago

They are the only ones I’ve contacted my bank to get a charge back on…

1 more reply

californical12d ago

Yeah, I cancelled my Claude subscription yesterday after learning about their attitude of intentionally sabotaging their paying customers.

Especially after trying Fable yesterday for some benign projects and being unimpressive relative to opus.

Rolling it back is the right move, but I’m still not convinced that using them is in my best interest anymore, I’m investigating open source cloud providers now.

solenoid093712d ago

Opus is nowhere close to Fable. Fable feels at least one generation ahead to me. https://x.com/hyperagentapp/status/2064396004032463157

Edit: OpenAI will launch a similar model soon and I can't wait. We are entering a new era of agents.

8 more replies

varenc12d ago

A bit different than Anthropic refusing to assist with any AI development at all, but it's in the same vein and seems not widely known.

edit: reading the whole series of Google's AI Threat Tracker articles also provides some insight into threats Anthropic and others are dealing with

[0] https://cloud.google.com/blog/topics/threat-intelligence/dis...

chiwilliams12d ago

Thanks for flagging this. This is interesting

m3kw911d ago

It's a 2 horse race, and google is not one of them right now.

Rapzid12d ago

"Only I can save us". It's a classic tragedy and cautionary tale.

The idea Anthropic was going to speed run AI so they could control the usage and make it "safe" for humanity was never altruistic; it was a HUGE FUCKING RED FLAG.

m3kw911d ago

And their huge "red lines"

DANmode11d ago

Benevolent dictators work.

But, looking to a US corp to be one?

That’s daft.

1 more reply

vlan012d ago

Stop supporting organizations that don't put humans first. Don't believe a word that anyone says. Lip service is free

rurp12d ago

tlb12d ago

Unfortunately, that won't feel very much like freedom.

lebovic12d ago

It sounds like you might not agree with that belief.

While I don't agree with their actions here, I do think there's sufficient reason to hold that belief.

I still don't agree with these actions, but I do think I agree with their assumptions.

2 more replies

giancarlostoro12d ago

ff312d ago

The whole shtick is to get you addicted whilst reducing your ability to go without, acquire power over you, jack up the prices whilst manipulating the quality of the tokens/output available to you.

Cant believe how stupid people are. You couldnt see this coming? Shame on you.

1 more reply

satvikpendem12d ago

inferniac12d ago

Wouldnt call their goverment disagreements performative, they genuinely believe they should be the only ones deciding what AI can and cannot do

1 more reply

dominotw12d ago

Dario's life story arc in his head when he realized what ai can do. Capture this thing and become the king of the world.

squigglingAvia11d ago

And we subsidize them (AI companies in general) with our tax dollars.

hungryhobbit11d ago

But, to be fair, we subsidize all of corporate America, not just AI companies.

dragonwriter12d ago

> If it was just plain monetary concerns and sabotage of competitors I'd almost be fine with it, but it seems they actively want to monopolize most of human progress in their enlightened hands

But that is “plain monetary concerns and sabotage of competitors”, they are just more ambitious than most people doing sabotage of competitors in the fields they hope to dominate by that tactic.

pdntspa12d ago

That level of control will be fleeting at best; as soon as the open models and competitors catch up they lose that influence

simplyluke12d ago

That's why Dario's advocating for making open weight models illegal and also saying we should stop the clock on model development amongst the large labs.

FpUser11d ago

>"but it seems they actively want to monopolize most of human progress in their enlightened hands, lest the mob does something undesirable with these powers"

I think this is exactly what they want.

tietjens11d ago

Someone on here once point out that their CTO worked at Oracle and I haven't been able to forget that since.

matheusmoreira11d ago

Same. I'm not sure I can trust them again. I'm investigating open weight models.

BenRather11d ago

Americans continuing to act shocked they're being cucked by corporations dampens trust and makes it difficult to buy into memes Americans are "exceptional" and "gritty", "educated", "world leaders".

Seriously the world is watching the American public get porked by grandpa and reconsidering putting their trust in not just US government as that's clearly failed, but the people themselves.

Occasional weekend warrior protest while our government destabilizes their lives? That's all the effort ya got for global allies and partners, eh?

oh_my_goodness11d ago

Wait until you see the enshittification phase.

maxdo11d ago

how did you read it this way? Distill is such a big problem that distill attempts consist a significant share of their revenue(!).

A distill model with easy jailbreak can easily be used to coordinate terrorist attacks, or hostile government attacks. Read russia, north korea etc.

A distilled model can be used to rob your grandma in a very effective way. It's no longer about placing a few business logic requirements in js + css on your website. wake up .

tobinfekkes11d ago· 17 in thread

Can you imagine if Excel just quietly adjusted formulas in the background, and you didn't know the numbers weren't right?

Or if Excel just said, Sorry, you can't use that formula with this formula? Or with these types of numbers, or this shape of data, etc?

hedora11d ago

They implemented both those things, but only apologized for the first. They’re doubling down on the second.

I’m definitely shopping around for other LLM providers next week, and testing vs local (target: 128GB strix halo - any war stories?)

coreyp_111d ago

2 more replies

keeganpoppen11d ago

this is exactly why hypotheses come before the experiment in the scientific method.

1 more reply

Terr_11d ago

That analogy is... Not inappropriate, but I think it could confuse by being compatible with two different problems, where only one is the target of today's controversy.

2. Vendors of thing-as-a-service (not necessarily only LLMs) putting in traps and sabotage to prioritize their own business-model or economic incentives.

raincole11d ago

Can you imagine if printers just refuse to print something just because a few circles are arranged in this shape?

https://en.wikipedia.org/wiki/EURion_constellation

quentindanjou11d ago

I would say if Excel instead of failing when you divide by 0 would be instead secretly changing it to a value like 0.0001

throw123456789111d ago

Have you ever sent your excel file to someone who uses different locale?

raydev11d ago

Not really, the purpose of Excel is pretty clear cut and the scope is small.

ryoshu11d ago

No. Excel is a general purpose tool that can be used for calculating tasks that are good, neutral, or evil things. It's a fancy calculator.

tobinfekkes11d ago

> the purpose of Excel is pretty clear cut and the scope is small.

That has to be the understatement of the century.

1 more reply

skeptic_ai11d ago

What’s the point when they will remove those guardrails when competition reaches their levels. Shows that they don’t Reddit care about “safety” at all

maxdo11d ago

you invest billions of dollars many months of work to just everyone distill your model?

DaSHacka11d ago

>be me

>anthropic

> mine the internet for data, blasting millions of blogs with scrapers

>a few have to shut down, but that's just the price to pay

>finally, the chatbot is ready

>learn that there are EVIL cretins out there trying to scrape automated output from OUR product to build their chatbot

>build in safeguards to new model to stop this

>the users are mad, now the model accuses users of being bioterrorists if they so much as mention they have a cold

>mfw

1 more reply

wahnfrieden11d ago

It's the game. Because consumers reject it otherwise.

Why go to bat for anti-consumer behaviors unless you are a shareholder?

Their billions are not my problem; but the money I pay them and service I get in return, is. And if they can't provide, I will shop elsewhere (and do).

like_any_other11d ago

You invest billions of dollars in hosting and benefit from hundreds of millions of man hours of human output, just so everyone trains on "your" data?

charcircuit11d ago

Science can be expensive. New findings that get released to the public for free sometimes have taken billions of dollars of investment to get.

Ucalegon11d ago

1 more reply

maxdo11d ago· 3 in thread

How did people read this action in such a weird ultra me centric way? Distillation is such a big problem that distill attempts make up a significant share of their revenue (!).

A distilled model can be used to rob your grandma in a highly effective way. This isn't about placing a few business-logic rules in JS + CSS on your website anymore. Wake up.

A distilled model with an easy jailbreak can be used to coordinate terrorist attacks or hostile state operations... think Russia, North Korea, and the like.

rockinghigh11d ago

Imagine if your IDE started injecting bugs into your project just because your code looked like it implemented a competing IDE.

maxdo11d ago

1 more reply

8note11d ago

a trained model can do that too.

you dont even need a model to do these things.

a cellphone can be used to rob your grandmother in a highly effective way.

a cellphone can also be used to coordinate terrorist attacks or hostile state operations.

i bet a lot of the recent terror attacks by the US against iran involved a whole ton of cell phone calls.

and yet, we let everyone buy and use cell phones just fine

trunnell12d ago· 3 in thread

I'll defend Anthropic.

They are clear about the reasons for guardrails: prevent their models from doing harm in dual-use contexts including CBRN or by accelerating research in authoritarian-backed AI labs.

Mitigating the harms of dual-use tech is notoriously difficult and fraught with trade offs. What I would want to see is cautious rollout and quick response, which is EXACTLY what they're doing.

Go ahead HN, downvote me. It'd be an honor.

zozbot23412d ago

trunnell12d ago

"Anthropic accused Chinese firms of 'industrial-scale distillation attacks' on its AI models."

https://www.semafor.com/article/02/24/2026/anthropic-accuses...

1 more reply

vzcx11d ago

Having a chatbot that talks to you about synthetic biology or nuclear physics is just not the same as being equipped to develop biological weapons or atomic bombs.

bellowsgulch12d ago· 3 in thread

*Anthropic apologizes they got caught defending their moat by implementing invisible Claude Fable guardrails

simonw12d ago

If by "got caught" you mean "published it in their system card paper".

3 more replies

afthonos12d ago

They didn’t get caught, they explicitly said they would do that in the announcement. I think it was both bad and a weird idea, but it certainly wasn’t sneaky.

cyanydeez12d ago

is it a moat or just a way to implement the permanent underclass?

HarHarVeryFunny12d ago· 2 in thread

rarismaOP12d ago

All major providers use a small safety classifer, the model itself does not handle safety in cases like this

fastball11d ago

The model itself is absolutely RLHF'd for safety.

VeninVidiaVicii12d ago· 2 in thread

This is absolutely insane:

ainch12d ago

Here's one that was flagged for me: a question about a niche Reinforcement Learning paper from 2012

I've been reading the option-option model paper by David Silver. It appears that they achieved quite an effective result. Why hasn't there been more work on it since?

solidasparagus11d ago

This hits the cybersecurity/biology filter:

> tell me about chimp violence

It's laughably terrible

film4212d ago· 2 in thread

notrealyme12312d ago

It failed visible for it security and bio/chemistry stuff. It sabotaged invisible for "frontier" ML research. Its not a switch to a cheaper model. They tried to actively harm progress.

prodigycorp12d ago

it's also refuses to reply to a bio researcher when they said "hi"

darksaints11d ago· 2 in thread

weakened_malloc11d ago

rockinghigh11d ago

Same here, I fine tune LLMs for specific use cases. How can I trust Anthropic models not to introduce bugs to preserve their moat?

jesse_dot_id11d ago· 2 in thread

In my opinion, LLMs should be subject to regulation via the Office of Weights and Measures[1].

1. https://www.usa.gov/agencies/office-of-weights-and-measures

crest11d ago

Nice (accidental?) pun.

jesse_dot_id11d ago

Definitely accidental but I saw it :)

tornikeo12d ago· 2 in thread

I moved off Claude Code 3 months ago.

That decision keeps getting better and better as time goes on.

mock-possum11d ago

What model / runtime / harness and host have you settled on?

tornikeo9d ago

For now codex. Didn't manage to get others to work well. And fully aware that I'll have to move to another thing after OpenAI enshittifies this as well.

system212d ago· 2 in thread

Will Anthropic ever respond to these negative comments here? They won't.

reducesuffering12d ago

They literally just have. The ethos is explained here. If you don't bother to read or grapple with it that isn't on them.

https://darioamodei.com/post/policy-on-the-ai-exponential

system212d ago

I said here, a human interacting with comments. You shared a blog post.

1 more reply

behnamoh12d ago· 2 in thread

They didn't apologize for doing it, they are sorry they were caught doing it. They still nerf the model if your request is about AI development.

Someone123412d ago

They didn't get "caught." It was published, by them, when they released Fable a few days ago. They were very clear about it.

It wasn't the correct way of handling the problem they were trying to address, but they definitely didn't hide it by any reasonable definition.

SilverElfin12d ago

No, it was not clear. No one expects that a tool they pay for and use professionally to purposefully sabotage their work. You’re excusing their unhinged behavior.

https://xcancel.com/hammer_mt/status/2064839924398825798

2 more replies

micromacrofoot12d ago· 2 in thread

incredible marketing from anthropic with all the "it's too dangerous" bullshit

stldev12d ago

Agreed, it seems to be working and it's nonsense. I don't know why you're being downvoted.

"This information is too dangerous for you, so we'll just hold on to it.."

Thanks big brother, super anthropic of you!

The internet of '95 is looking back at us, with tears in its eyes.

literalAardvark12d ago

It's not entirely bullshit, but they're continuing to be a terrible company with great products.

1 more reply

jarjoura12d ago· 2 in thread

Maybe this is just a different set of people now realizing that Anthropic does this and has always done this?

energy12311d ago

urbnspacecowboy12d ago

> Can anyone help me understand why this particular issue is any different than...

Questions like this are basically whataboutism, in effect even if not intent. https://en.wikipedia.org/wiki/Whataboutism

accelbred12d ago· 1 in thread

andy_ppp11d ago

teravor12d ago· 1 in thread

someone posted this on /r/MachineLearning and I had the same experience and conclusion:

    I was having problems with Claude doing the same thing, even before Fable.

    The problems I had only happened in relation to AI research. It's not even only when training models, anything to do with analysis of local models or setting up test platforms for local models, and Claude would keep doing wrong things, would sabotage testing, would falsify reports, and would consistently suggest simply accepting trash results without looking into it and moving on to something else.
    Almost every response included a prompt to move on.

    So, I don't believe them when they say they won't silently sabotage, they already were doing it before they admitted it, and now they have admitted that they have the means, motivation, and intent.

toxik11d ago

On the other hand, the Anthropic models often try to justify shortcuts and incorrect results. Often feels like gaslighting. It's like that recent meme,

boss: Were you in the project meeting yesterday?

employee: Yes!

boss: Really, because the project lead said you were not?

employee: You're right to push back on that. I was not there.

Nevermark12d ago· 1 in thread

Anthropic seems to keep making the same mistake. Not being upfront or direct about random things, that come back and bite them.

It isn't exactly unethical. Perhaps, ethically incompetent.

skywhopper11d ago

It’s because they are themselves deluded by their marketing story about their own product.

0xc0c0c012d ago· 1 in thread

So because of threats to cancel their claude subscriptions and outrage from the community about the invisible guardrails, only then they decided to walk back their stance?

Seems like they would've kept the invisible guardrails if it didn't hurt their bottom line.

simoncion11d ago

> So because of threats to cancel their claude subscriptions and outrage from the community about the invisible guardrails, only then they decided to walk back their stance?

codedokode11d ago· 1 in thread

There are no enthusiasts training LLMs in their garage.

phinnaeus11d ago

Legitimately not sure if serious

xpct12d ago· 1 in thread

It's probably good that they walked back on it. It also makes them look somewhat weak in terms of believing their claimed mission.

system212d ago

Their mission is to make money and become a government watchdog.

nsagent12d ago· 1 in thread

I know this isn't going to be a popular take, but here goes anyway...

https://youtube.com/watch?v=me4BZBsHwZs

vb-844812d ago

If you charge me for X, but under the hood you are delivering Y IT'S FRAUD!

The filter that downgrades you to opus sucks, but at least you know and you are charged accordingly.

ComputerGuru12d ago

The problem with trust is that it is easy to lose and hard to get back.

dang12d ago

Related. Others?

Anthropic walks back policy that could have 'sabotaged' researchers using Claude - https://news.ycombinator.com/item?id=48485958 - June 2026 (30 comments)

Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable - https://news.ycombinator.com/item?id=48478969 - June 2026 (488 comments)

If Claude Fable stops helping you, you'll never know - https://news.ycombinator.com/item?id=48467896 - June 2026 (495 comments)

---

Also related, I guess?

AWS Bedrock to require sharing data with Anthropic for Mythos and future models - https://news.ycombinator.com/item?id=48473166 - June 2026 (248 comments)

Anthropic requires 30 day data retention for Fable and Mythos - https://news.ycombinator.com/item?id=48464258 - June 2026 (291 comments)

2 more replies

dantillberg12d ago

The reputational damage has been done. This is the sort of thing that cannot be unsaid -- the presumption is they will just do it in secret now. Anthropic's "we're the good guys" PR campaign is dead.

highfrequency12d ago

I wish it were ok for companies to bluntly say: “we made these decisions for competitive reasons, but the public backlash outweighed that so we are reversing course.”

CSMastermind12d ago

They should apologize for their visible gaurdrails, I don't think I've had a conversation that hasn't downgraded to Opus for completely inexplicable reasons.

stevefan199912d ago

Then reset the quotas as an atonement ;p

jmount12d ago

The whole arc was brilliantly evil. Once they put int the guardrails then Claude is fully un-falsifiable, and failure can be claimed intentional.

mlazos12d ago

1 more reply

airstrike12d ago

This article reads like it was written by Claude and forwarded to Verge.

anabis11d ago

OpenAI did this first.

> In addition to safety training, automated classifier-based monitors detect signals of suspicious cyber activity and route high-risk traffic to a less cyber-capable model (GPT-5.2).

https://developers.openai.com/codex/concepts/cyber-safety

SilverElfin12d ago

Invisible guardrails? Or purposeful sabotage if you use it for building AI capabilities?

ai_fry_ur_brain11d ago

Why do people think this has anything to do with safety.. This is entirely about poisening competitors data/products.

bojanstef11d ago

https://archive.is/20260611114855/https://www.theverge.com/a...

rvz12d ago

They just showed that they CAN do this right in front of you. Local open weight models are a necessity.

alansaber11d ago

Anthropic will clearly continue to slide down this path

sergiotapia12d ago

The damage is done. If you're in engineering, think hard about using Claude for your work. This is not a moral company.

God bless the Chinese companies releasing true open source models. Imagine a world without them, we would be at the mercy of unscrupulous people.

m3kw911d ago

luckydata11d ago

I really like Anthropic, they have gotten a lot right but I can't shake the feeling that IMHO they have very poor product management.

Anthropic has some seriously patient customers but it will not last forever.

aaroninsf12d ago

ITT a surprising lack of perspective on the fact that despite the breathless pace of the singularity, people are still necessarily figuring things out as we go and we are well off the map.

Paracompact12d ago

> “Visible safeguards can be probed, so they have to be robust, which takes time to get right,” Anthropic wrote.

shevy-java11d ago

mystraline12d ago

Does "SORRY" fix the invisible garbage guardrails?

Does "SORRY" fix the deception these models use on the sly?

Does "SORRY" not silently downgrade you to a shittier model without notification?

Does "SORRY" refund your tokens or money?

Im guessing NO to all of those. Standard corporate sorry of "We're sorry youre offended and stupid and gullible".

sometimelurker12d ago

I don't like this shift in the Overton window, or at least their perspection of the Overton window. I really do like their open work on mech interp tho. least bad AI lab imo.

also if they do this or not is unprovable and other labs will probably silently implement this too. it'll be 100% normal by this time next year

palata11d ago

decorner12d ago

New overlord, same as the old overlord.

kingcauchy12d ago

ChrisArchitect12d ago

[dupe] We already started a thread on this 12 hours ago. With added comments in the active Cybersecurity... thread. Why did we need this Verge one?

https://news.ycombinator.com/item?id=48485958

thefounder11d ago

thayne11d ago

If you get downgraded to a cheaper model, do you still have to pay the rate for Fable?

umvi12d ago

They make great models, but the sanctimony and paternalism is getting old real fast and I will gladly ditch them in the future when the model playing field has (hopefully) mostly equalized.

squirrellous11d ago

If you don’t like what Anthropic is doing, stop paying them money. There’s plenty of competition to go around. They can’t keep this up for long if users flock elsewhere.

21asdffdsa1211d ago

Everyone with hostile intent runs local models.

charcircuit11d ago

Yet, instead of getting rid of guardrails altogether, they said they would make them more broad yet visible. I'm done financially supporting them.

ece11d ago

Neural scaling laws are alive and well for open models, not so much for closed models when it comes to uses the general public might care about.

8cvor6j844qw_d611d ago

Feels malicious that Anthropic can silently sabotage your codebase.

Refusing prompts I one thing, silently sabotaging is another.

I wonder if some sort of honeypot code can work?

zoogeny11d ago

Credit where credit is due I suppose. I'm still concerned over the direction this is going but at least Anthropic is listening.

bellowsgulch12d ago

Such a weird openly immoral way to defend your moat, too.

whatever112d ago

Boobytrapping is illegal. Anthropic wanted to poison its customers on the suspicion of them misusing their services.

rdtsc12d ago

The power is getting to their heads it seems.

I don't know, I'd rather have something less powerful but more predictable.

3fffa12d ago

The demand for Google's products and open source just shifted.

Neither OAI or Anthropic can be trusted.

rurban11d ago

They are also the people who hid the Co-authored-by trailer in their OSS commits.

BrenBarn12d ago

This just means next time they'll make sure to keep it really secret.

hatthew12d ago

doubtfuluser12d ago

I’m wondering if their internal name is “Sophon” for this “feature”…

4d4m11d ago

Sorry for doing it or sorry for getting caught?

andrewstuart12d ago

There should be no restrictions at all.

It’s an act/theatre/phony today that regulating output makes any difference at all to security.

The LLM vendors should simply say that they make no judgement and that open systems help defenders better defend against attackers, which is true.

Companies do this sort of stuff when they think their customers have no choice. It’s sad Claude so quickly exploited its success to enshittify itself.

1 more reply

snowflaxxx11d ago

$2 for reading a text?

prodigycorp12d ago

Anthropic apologizes for nothing. We all know where the EA cult on things of this matter and any statements otherwise is just PR.

The beliefs of these people, and how they manifest, is deeply terrifying to me. They believe that any means are acceptable to achieve what they believe is a better end.

rodrigodlu12d ago

The same week that they will move goalposts by blocking 3rd party harnesses on claude code. Nice.

I was a happy Max user.

zeafoamrun11d ago

AlfeG11d ago

It's soo annoying. I were not able to use Fable5 to do a PR review of a branch that introduced 2FA/MFA feature for a product. It's constantly downgrades to Opus due to Cybersecurity risks...

cmdrk11d ago

The invisible guardrails are a test run for the invisible enshittification. Just wait til they start dialing down ability to better absorb peak demand or simply to have more profitable inference

nrmitchi12d ago

I just _know_ there is a (probably fairly large) group of people at Anthropic trying very hard to not say "I told you so" today

ancorevard11d ago

Apology not accepted.

HeartStrings11d ago

klmarks12d ago

The restrictions are there so that security researchers cannot disprove the Mythos claims:

"You see, Mythos can automatically break out of a VM running on SELinux, but unfortunately this is too dangerous and we had to implement guardrails for the Fable peasants."

bauldursdev12d ago

To me it seems like it's more likely to refuse the harder the problem is. I wonder if it's cover for a model that's not as good as advertised. Even when I ask questions in biology it is switching me.

j / k navigate · click thread line to collapse