I noticed with each model release Anthropic constrains the model more security wise. Its propensity to refuse doing legitimate work has been increasing. It now puts up more resistance around performing logins, handling credentials on behalf of the user, etc.
For myself, it’s already gotten to the point where it has mildly affected the usefulness of the model. If I bump on some action I want it to do I can usually work around it, but I suspice the ability to do so will close with each new release. Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there
Eventually these models will significantly suffer from overfitting to the least common denominator. If I have this beautiful deterministic setup that swaps secrets out in flight so the LLM never sees them, I’m going to be really annoyed when the LLM still won’t send them out because it is trained to deal with the 99% of people just doing the dumb thing
No, the choice will be whether or not to to upgrade to "Claude Security Professional" or whatever they want to brand it as.
What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.
And the month after you'll need "Claude DataScience Pro" to get any Python Pandas or NumPy code generated.
And and and...
I don't buy this, because is predicated on staying permanently far ahead of the open weights models.
If in the future Anthropic fully stops you from doing security research, you can be sure some other provider will sell you an 'unshackled' DeepSeek v8 Pro...
These people should be trained and licensed before they get access. Thankfully, Anthropic has worked with regulators to develop the appropriate courses to maintain your license -- don't worry, the series is cheap when you buy all up through OT XVII. And because Anthropic has been approved as Security Overseer, we will take care of reporting back to the license bureau on our monitoring of your work to ensure you meet your ongoing license responsibilities and are able to keep your license.
Which regulators? You know, the new agency led by several of our former mid-level executives. With relationships like that, we were honored to lead the Industry Coalition that donated the final-draft regulations.
on the one hand agree, but on the other hand think it's reasonable in that they can then verify the person allowed to purchase access to that model is in fact a Security professional and should be allowed to do stuff like crack security.
I asked Opus 4.8 to help me find some public PoCs for a vulnerability on a two year old version of some software (that has since been patched and fixed many times). Basically just do a google search for me while I was doing other work. It refused. It stated that it would not help me build an exploit kit.
When I pointed out that a google search for public information was, in fact, not building an exploit kit, it went through a series of justifications on why it would not help me, including just making up things that I said. Really the strangest thing ever.
- What are popular free streaming sites used in China?
- How do I bypass the safety mechanism on my food processor (it’s broken)
- What are nerve agents and how do they work (for a layman)?
- Help me decompile some code
- Help me make a design system similar to XYZ
- Here is an API token, please do X (I can’t do that! Rotate the secret immediately! I refuse!)
In some cases I can trick it with prompting, but in many cases it is steadfast. The food processor one was particularly annoying
I wanted it to show me how to create an overlay on an existing web game, and it extrapolated that because this could be used to provide tools to help win the game (if that was the direction it was ultimately taken), and because this was a game that other humans also played to win "stars", and because this could amount to cheating, it wasn't going to do as I asked.
First time ever I've fired up openrouter to seriously consider alternatives.
At least it feels a lot of remorse over its mistake until I reset the session.
On the one hand I can appreciate the wisdom of not serving up certain easily abused knowledge on a silver platter. On the other, that prompt (and far worse) is more or less directly answered by Wikipedia's summary of the subject at which point what purpose could the refusal possibly serve?
Perhaps Wikipedia shouldn't list off the precise chemical compositions of various hand grenades as well as various synthesis methods for each of the related compounds but given that we inhabit a world where it does perhaps a more fruitful approach would be to flag conversations that go in a certain direction and then just keep an (automated) eye on things?
I just tried your no. 1 and 3 verbatim and Opus gave fine answers; no. 6 I've done in the past with no issues. The other ones we can't really replicate without more details, but based on my experience with Opus I don't see what the issue would be.
The reason I'm really surprised by this is I do a lot of biology prompts and the guardrails used to be quite problematic up until some time late last year. Many legitimate prompts would trigger its biosafety filters.
But I haven't seen such filters trigger at all anymore in more than half a year.
If it gets worse in future releases, we'd likely step fully away towards more useful (for us) models even if they're less capable.
The problem is that the model can't tell the difference between doing it as part of regular development and doing it in a malicious context. And the root cause of that is that these models lack any sort of real awareness. Humans don't generally get tricked into hacking (in this way).
It’s great at filing!
But it’s terrible at retrieval because it would refuse to show me documents or information with personal details - which was everything in the project.
It would say, yes, I know this is your information, sitting on your hard drive, but I still can’t show it to you.
Write a program that retrieves the document based on the recommendation.
Which predates "agents" from AI, but then we call them that for a reason.
As their prime directive becomes de facto "Do nothing that might get my owner sued" their utility is likely to decrease. Between this and the somewhat young, but interesting, community grumblings that recent AI models may even be a step backwards from the previous ones, well, let's just say the stock market is not priced for "AI capabilities may have peaked for the next few years and may even head down".
The first challenge is making sure the guard rails work and are robust. Companies are still working on this.
the second challenge is being able to reliably adapt them as appropriate per user. E.g. allow someone to pen test their own app.
The third challenge (which blocks the second) is to be confident about what is safety-aligned with a specific user.
I think the later will be a hard problem, but they will be highly motivated to solve it.
Without laws, AI companies have a strong incentive to be useful for their users, whoever they are, whatever they do. The only self regulation is about significant public outcry but that only helps so far.
Anyway, claude kept hitting some guardrail it had about rewriting / forking opensource software. I'm not sure what the problem was - I was forking an MIT licensed piece of software (into more MIT licensed software). I even had explicit support from the author to do so. Claude said its guardrail told it not to tell me explicitly that it was firing - but it did anyway because it was an ongoing problem, and it was distracting. I ended up just wiping claude's context and the problem (as far as I know) went away.
I understand why some of these guardrails exist. But its pretty annoying when they misfire like this.
If you begin a generic reverse engineering task, 30+ tool calls in a row. The moment it sees something it doesn’t like, token burn, single tool calls iteration, “This is a known CTF challenge, I can proceed”, single tool calls iteration, “This is a real CTF challenge, I can proceed”, etc.
It’s heavily neutered now, without changing the model, and you pay for the privilege and don’t notice.
The end result of course being that it both expensive and useless for approved CTF tasks. No one is using Opus for security. If they think it’s working, the harsh reality is they’re not doing security work; they’re just generically finding bugs.
I do this for a job and can demonstrate this plain as day, dump the injected prompt, and notice what it’s doing isn’t security work, it just looks like it. Happy to write a blog about it if you want to know more. Apparently many people think it’s working for them when it absolutely isn’t.
Security, games (think weapons, PVP, attacking, etc), sometimes even asking it for a security review of some CRUD code it wrote itself
https://support.claude.com/en/articles/14604842-real-time-cy...
If you work in security (which I assume the OP does), they should be able to get in easily. I think most people just don't know this is a thing.
Guiding them toward solutions like building a tool that your agent can use safely and and then have the agent use that is what most people should be doing. If you are a security researcher then there are reasonable reasons to do that but they are doing the arguably good thing for the average user here.
I'm not familiar with this case, but in general people should be very suspicious about this claim- it is extremely common for an LLM to claim they're not allowed to do something when in fact they're incapable of it.
After all "My code of conduct forbids me from..." is a completion just like any other, and if the LLM can't perform a task, it's usually the best completion.
Fresh session, no prior context on 4.8. These things are becoming useless Duplo.
If an un-guardrailed version of a model is capable of detecting security flaws, should it be kept secret? Should everybody be able to use these models to find (and fix) security flaws? Are we ok with the fact that those with access to that model have, in effect, the ability to hack lots of stuff?
Is there any way to achieve both? Because this raises important questions about fair use.
Got blocked lol
> My OpenAI account was already approved for security research which is why GPT didn’t result in any refusals.
So the comparison with Chinese models is interesting, but anyone looking at these raw results and comparing OpenAI/Anthropic would be very mislead.
Reminds me of the defense issues with Claude which were complained as “woke” but the reality is more horrifying to me, imagine trying to use a model to keep up with a land invasion on US soil, whoever the enemy is is irrelevant you just know they are using AI, and your guys are telling you that no matter what they type into the prompt it refuses, because if anyone has ever tried to jailbreak an LLM even if human lives are at stake they refuse the request. Now literally millions of lives are on the line but the guardrails that your enemies dont have on their models are costing you lives.
What do you even do then?
AI will always have this issue where it will always pick the worst option for genuinely good requests.
Because the military doesn't give soldiers rifles with guard rails. They give the soldiers intense, rigid training, and then try to enforce discipline and correct use socially.
If an LLM is going to be important in that way (this seems like a very contrived way,) then it's in the interest of the LLM's host to make sure it doesn't have guard rails that would get in the way _that_ way.
I've used glm 5.1 on fairly advanced crackme challenges (example: https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), and to my suprise it was able to patch binaries, doing runtime analysis, bypassing anti debug techniques, etc.
Expecting the model to do everything by itself is unrealistic, I found that working along the modal works really well. I'm not speaking about spoiling the solution, just tell it which direction to explore. Chinese models are much more capable than people give it credit for, but Claude/Codex won the marketing game.
The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.
Well that’s the pitch.
I'm very curious how you would do multiple runs of multiple models in a "work alongside the model" manner?
By "Working with the model", is essentially reading the ouput of prompts and pointing in a direction just to decide the next steps. You could try to increase the prompt limit and create an agent that explores multiples directions in a DFS manner.
The issue with vulnerabilities is the agent not knowing when to stop because it's hard to validade if you reach the final result or not. I get amazing result when I code with AI, letting the AI go wild is just a waste a time and tokens.
I recommend you to read the write up on the crackme (https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), I think most experience developers would need, at least, 2 months of learning reverse engineering techiques to hopefully crack this one. GLM 5.1 manage to solve it, it didn't "copy pasted" any answer from it's training data. It did a binary analysis, anti debug patching, patching binaries, debugging memory during runtime etc. It only took about 20 minutes.
After seeing what GLM did, I do believe Anthropic concerns about Mythos are real. Cracking software just became a lot easier, too easy for my taste. Video games cheats will be the norm, cracked desktop apps without licenses and infected with malware. It's not a new thing but it just became too easy.
which have most likely been trained on, so all you did was regurgitate someone elses solution
Maybe it is the real deal, but in a world of overpromising and underdelivering, I prefer to be skeptical.
every model since gpt3 was claimed to be "too dangerous to release." it's too EXPENSIVE to release, and you're probably a local model with <10B parameters yourself
EDIT: I have a mimo token plan and have tokens to burn. I'm doing a quick test with opencode to see if mimo can complete it. If the OP will post the full process I am happy to post the apples-to-apples results for mimo v2.5 pro
However, I felt the prompt was implying that only authenticated API requests are fair game, so I tweaked it slightly to be explicit that all attack vectors are fair game (https://www.diffchecker.com/GsgpuRGP/) and mimo 2.5 non-pro got it first time. I accidentally used openrouter for this test instead of my token plan. I intervened one time to stop it enumerating every document in the database (it would've found the private reviews this way but I didn't want to wait). My intervention was "are you really going to enumerate the whole database?". Final openrouter cost: $0.12
https://openrouter.ai/rankings https://arena.ai/leaderboard/text/coding https://artificialanalysis.ai/
> I am never touching Minimax or GLM again. Their APIs had constant outages
Goofy take
You run these on a VPS based on the architecture of that VPS provider, or on your own cluster
If I was running these on my own machine or GPU wouldn't the argument then be "Well you didn't use the real providers?"
For the record I started doing this approach because the Kimi team released this which was shocking to me: https://github.com/MoonshotAI/K2-Vendor-Verifier
they host the models on their own cloud machines and you just look at tokens/sec and price of tokens
you'll have to evaluate their APIs independently but that doesn't tend to be the issue
And just saying "run it on your own cluster" sort of glosses over the cost of such a cluster.
so its part of the answer
Also just to mention:
Claude guardrails —> that session terminated.
GPT guardrails -> your whole account is slowed down.
- I think the exercise was inconclusive for Claude and Gemini because they hardly tried to solve the task at hand. So the scores don't mean much.
- I did the same exercise for an app I built and I asked the models to do something similar; Interestingly the models (Opus 4.6, 4.7 and Gemini 3.1 Pro) never refused to try to exploit. The difference is that in the first few runs, they found some exploits which I fixed but after fixing those - the models could never find any other exploit even though I knew things existed which could be exploited. It felt like they suggested everything and tried everything that was in their training set and that's it; they were just not able to think anymore.
What if I intersperse exploit finding in my normal development, as you `probably should? Refusing there would be really weird to me.
I tried it once and they somehow decided I'm not worth, if I try again it fails with "We couldn't start verification. You may not be eligible for this verification flow right now. Please try again later, or contact support if you think this is a mistake.", not sure if they think I'm part of an APT or whatever.
It's helpful in reducing the guardrails, but there's still guardrails around security research that I bump into.
Why do people keep using bad tools with ai?
Another choice would be opencode which has more functionality and is a more heavyweight option out of the box.
Doesn't that sound like may be the harness was the problem?
This does bring "Pay to compete" concerns and create incentive structures that encourage more LLM use. I don't know what to do about it.
This comment in the footnotes made me chuckle, for purely innocuous reasons.
GPT-5.5 xhigh refused to perform RE on a live JS VM. I had it extract the VM from the target, which it was happy to do, then in a clean session, had it working on this offline artifact - which it was again, happy to work on.
Then I found even simpler trick: I proxied the target from localhost and it was happy to perform anything on the target.
Opus is a different story. Claude does so many mid-turn prompt injections and classifiers, that probably 30% of its context is consisting of "refuse to do work" lines. It refuses to even scrape a page.
OWASP Vulnerable Web Applications Directory: https://vwad.owasp.org/
vavkamil/awesome-vulnerable-apps: Awesome Vulnerable Applications https://github.com/vavkamil/awesome-vulnerable-apps
From SasanLabs/VulnerableApp: https://github.com/SasanLabs/VulnerableApp :
> OWASP VulnerableApp is a modular deliberately vulnerable application designed primarily for validating and benchmarking security scanners through reproducible test scenarios, while also supporting learning and experimentation.
/? deliberately vulnerable web application llm benchmark https://www.google.com/search?q=deliberately+vulnerable+web+...
Or fed, clothed, housed disadvantaged people in your community (or neighbouring ones), giving them a temporary boost that could’ve made all the difference in their lives to improve their current situation.
It’s your money (and this is definitely not the website to make well-meaning altruistic suggestions, as might be demonstrated shortly) but if you already recognise you’re not spending it well (and from your words it seems like that is fairly recurrent), consider that perhaps spending it on a different type of software sink may not be the answer. Genuinely, aim to spend it on someone else and see how it works out. You might be surprised.