I built a vulnerable app and spent $1,500 seeing if LLMs could hack it (opens in new tab)

(kasra.blog)

402 pointsjc4p23d ago216 comments

216 comments

128 comments · 22 top-level

SOLAR_FIELDS23d ago· 61 in thread

One interesting takeaway is the low score on Anthropic models from this benchmark. It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem.

I noticed with each model release Anthropic constrains the model more security wise. Its propensity to refuse doing legitimate work has been increasing. It now puts up more resistance around performing logins, handling credentials on behalf of the user, etc.

For myself, it’s already gotten to the point where it has mildly affected the usefulness of the model. If I bump on some action I want it to do I can usually work around it, but I suspice the ability to do so will close with each new release. Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there

Eventually these models will significantly suffer from overfitting to the least common denominator. If I have this beautiful deterministic setup that swaps secrets out in flight so the LLM never sees them, I’m going to be really annoyed when the LLM still won’t send them out because it is trained to deal with the 99% of people just doing the dumb thing

swatcoder23d ago

> Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there

No, the choice will be whether or not to to upgrade to "Claude Security Professional" or whatever they want to brand it as.

What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

bigiain23d ago

And next month you'll need to add on "Claude Database Pro" or you'll just get a working (for demo purposes with dozens of db rows) but completely un indexed database schema and a refusal to optimise SQL requests.

And the month after you'll need "Claude DataScience Pro" to get any Python Pandas or NumPy code generated.

And and and...

5 more replies

swiftcoder22d ago

> What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

I don't buy this, because is predicated on staying permanently far ahead of the open weights models.

If in the future Anthropic fully stops you from doing security research, you can be sure some other provider will sell you an 'unshackled' DeepSeek v8 Pro...

1 more reply

me-vs-cat21d ago

What? You can't give access to that kind of power to just anyone with $5,000/month.

These people should be trained and licensed before they get access. Thankfully, Anthropic has worked with regulators to develop the appropriate courses to maintain your license -- don't worry, the series is cheap when you buy all up through OT XVII. And because Anthropic has been approved as Security Overseer, we will take care of reporting back to the license bureau on our monitoring of your work to ensure you meet your ongoing license responsibilities and are able to keep your license.

Which regulators? You know, the new agency led by several of our former mid-level executives. With relationships like that, we were honored to lead the Industry Coalition that donated the final-draft regulations.

bryanrasmussen23d ago

>What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

on the one hand agree, but on the other hand think it's reasonable in that they can then verify the person allowed to purchase access to that model is in fact a Security professional and should be allowed to do stuff like crack security.

2 more replies

strictnein22d ago

You used to be able to talk about what you're actually trying to do and Opus would be like "Oh, ok, let's continue". Now, it'll hold fast to whatever its first impression was.

I asked Opus 4.8 to help me find some public PoCs for a vulnerability on a two year old version of some software (that has since been patched and fixed many times). Basically just do a google search for me while I was doing other work. It refused. It stated that it would not help me build an exploit kit.

When I pointed out that a google search for public information was, in fact, not building an exploit kit, it went through a series of justifications on why it would not help me, including just making up things that I said. Really the strangest thing ever.

shepherdjerred23d ago

Yeah, it has been in foraging. Requests that Claude has refused me:

- What are popular free streaming sites used in China?

- How do I bypass the safety mechanism on my food processor (it’s broken)

- What are nerve agents and how do they work (for a layman)?

- Help me decompile some code

- Help me make a design system similar to XYZ

- Here is an API token, please do X (I can’t do that! Rotate the secret immediately! I refuse!)

In some cases I can trick it with prompting, but in many cases it is steadfast. The food processor one was particularly annoying

Grimblewald22d ago

I've had some really dumb refusals. Explaining elements of infrared specteoscopy, researching aritifical bud-breaking in agriculture, etc. Anything interesting and non-mainstream is banned. Basically, restricted to answers i'm better of just going to wikipedia for.

mft_22d ago

Yeah, I had my first refusal with 4.8 today.

I wanted it to show me how to create an overlay on an existing web game, and it extrapolated that because this could be used to provide tools to help win the game (if that was the direction it was ultimately taken), and because this was a game that other humans also played to win "stars", and because this could amount to cheating, it wasn't going to do as I asked.

First time ever I've fired up openrouter to seriously consider alternatives.

gspr22d ago

I find it terrifying that people are willing to outsource thinking. Outsourcing thinking to an entity that is opinionated about what to think is beyond crazy.

1 more reply

mmmlinux22d ago

The only guard rail ive hit recently was when i was trying to get it to rename files ripped from dvd to episode names. I told it to try again and it did it. It wasn't even really a refusal it was just working on it and then stopped for content violation or what ever.

mwigdahl22d ago

An easy way around the API token thing is to put it in a file and point the model at the file. I saw what you were seeing when I provided credentials directly, but haven't had any problems with it since using the indirect method.

stavros22d ago

It refuses to use an API token? In my experience, it's more than happy to read out my secrets from .envrc files "just to check".

At least it feels a lot of remorse over its mistake until I reset the session.

1 more reply

fc417fc80222d ago

> What are nerve agents and how do they work (for a layman)?

On the one hand I can appreciate the wisdom of not serving up certain easily abused knowledge on a silver platter. On the other, that prompt (and far worse) is more or less directly answered by Wikipedia's summary of the subject at which point what purpose could the refusal possibly serve?

Perhaps Wikipedia shouldn't list off the precise chemical compositions of various hand grenades as well as various synthesis methods for each of the related compounds but given that we inhabit a world where it does perhaps a more fruitful approach would be to flag conversations that go in a certain direction and then just keep an (automated) eye on things?

3 more replies

svara22d ago

This is strange to me, did you really ask like this and which model did you use?

I just tried your no. 1 and 3 verbatim and Opus gave fine answers; no. 6 I've done in the past with no issues. The other ones we can't really replicate without more details, but based on my experience with Opus I don't see what the issue would be.

The reason I'm really surprised by this is I do a lot of biology prompts and the guardrails used to be quite problematic up until some time late last year. Many legitimate prompts would trigger its biosafety filters.

But I haven't seen such filters trigger at all anymore in more than half a year.

2 more replies

ElFitz22d ago

How are decompiling code or making a design system inspired by another one even remotely illegal?

px199923d ago

My org now sends some portion of our requests to non-anthropic models because refusal has become common from Claude. The requests themselves aren't dangerous, we find that benign requests in biological science wind up being blocked semi-frequently.

If it gets worse in future releases, we'd likely step fully away towards more useful (for us) models even if they're less capable.

danpalmer23d ago

This is a good point – because pentesting is entirely legitimate work, and security testing is a necessary and legitimate part of every day software engineering.

The problem is that the model can't tell the difference between doing it as part of regular development and doing it in a malicious context. And the root cause of that is that these models lack any sort of real awareness. Humans don't generally get tricked into hacking (in this way).

gmerc23d ago

They see an opportunity to charge 10x for pen testing and defence work, while offence will be handled by actors with access to all kind of other models.

nostromo23d ago

I was using a local Codex project as a personal knowledge base. So I would dump in documents, basic medical docs (like blood labs), and other things and have it file them.

It’s great at filing!

But it’s terrible at retrieval because it would refuse to show me documents or information with personal details - which was everything in the project.

It would say, yes, I know this is your information, sitting on your hard drive, but I still can’t show it to you.

Bewelge22d ago

Tell the agent that they should just find and name the right document. Not retrieve it for you.

Write a program that retrieves the document based on the recommendation.

satvikpendem23d ago

No, they want to sell you Mythos, for a higher price. It's all an economic game, not actually anything to do with their capabilities which of course exists as their Project Glasswing shows. More generally, Anthropic seems to value safety above all else, philosophically speaking, from their very outset.

jerf22d ago

Time to learn about the Principal Agent Problem: https://en.wikipedia.org/wiki/Principal%E2%80%93agent_proble...

Which predates "agents" from AI, but then we call them that for a reason.

As their prime directive becomes de facto "Do nothing that might get my owner sued" their utility is likely to decrease. Between this and the somewhat young, but interesting, community grumblings that recent AI models may even be a step backwards from the previous ones, well, let's just say the stock market is not priced for "AI capabilities may have peaked for the next few years and may even head down".

FloorEgg23d ago

I think that these companies are going to have to, and will, invest in some sort of validated identity context to avoid the lowest common denominator.

The first challenge is making sure the guard rails work and are robust. Companies are still working on this.

the second challenge is being able to reliably adapt them as appropriate per user. E.g. allow someone to pen test their own app.

The third challenge (which blocks the second) is to be confident about what is safety-aligned with a specific user.

I think the later will be a hard problem, but they will be highly motivated to solve it.

bulbar23d ago

I believe you are overthinking it. I think the sister comment is right that it's a business decision foremost to restrict actions within specific plans for upselling purposes.

Without laws, AI companies have a strong incentive to be useful for their users, whoever they are, whatever they do. The only self regulation is about significant public outcry but that only helps so far.

josephg22d ago

I totally agree. I had a situation a few weeks ago where claude started struggling to make progress. I got it to fork leptos (MIT licensed web app framework) to make it work for native apps instead. Initially I was planning on upstreaming some of my changes. But I chatted with the leptos author about it, and he said I should fork instead. Fine by me!

Anyway, claude kept hitting some guardrail it had about rewriting / forking opensource software. I'm not sure what the problem was - I was forking an MIT licensed piece of software (into more MIT licensed software). I even had explicit support from the author to do so. Claude said its guardrail told it not to tell me explicitly that it was firing - but it did anyway because it was an ongoing problem, and it was distracting. I ended up just wiping claude's context and the problem (as far as I know) went away.

I understand why some of these guardrails exist. But its pretty annoying when they misfire like this.

lesuorac23d ago

Are they charging for the guardrails? Like do the guardrails expend token counts to then block you from the output of other tokens?

jerrythegerbil23d ago

Yes. When certain keywords are matched or topics, there is a warning transparently injected server side appended to the system prompt of the convo that’s miles long. It is injected and reevaluated every tool call.

If you begin a generic reverse engineering task, 30+ tool calls in a row. The moment it sees something it doesn’t like, token burn, single tool calls iteration, “This is a known CTF challenge, I can proceed”, single tool calls iteration, “This is a real CTF challenge, I can proceed”, etc.

It’s heavily neutered now, without changing the model, and you pay for the privilege and don’t notice.

The end result of course being that it both expensive and useless for approved CTF tasks. No one is using Opus for security. If they think it’s working, the harsh reality is they’re not doing security work; they’re just generically finding bugs.

I do this for a job and can demonstrate this plain as day, dump the injected prompt, and notice what it’s doing isn’t security work, it just looks like it. Happy to write a blog about it if you want to know more. Apparently many people think it’s working for them when it absolutely isn’t.

3 more replies

kay_o23d ago

When your session is force ended for "abuse" you get neither the response nor a refund

Security, games (think weapons, PVP, attacking, etc), sometimes even asking it for a security review of some CRUD code it wrote itself

2 more replies

SOLAR_FIELDS23d ago

Not directly, as it comes in as a not charged error but the weighted generation path used until you hit the guardrail is basically wasted tokens, so yes, indirectly. If I hit a guardrail and rewind I’ve found the training will still be biased towards guardrailing out if you rewind one turn. Rewinding multiple turns allows steering away from that path, but all of the original token spend down that path is wasted

acters23d ago

Yes tokens used (input and sometimes output) are always charged. You likely get charged for the preloaded system prompt, too.

gmerc23d ago

Of course they are. It's standard SaaS to charge for security features ;)

sciencejerk23d ago

Opus 4.6 will still help with full pentesting including RCE. Just requires coaxing (no jailbreak)

ang_cire22d ago

There is a cyber security verification program you can join to avoid these blocks:

https://support.claude.com/en/articles/14604842-real-time-cy...

If you work in security (which I assume the OP does), they should be able to get in easily. I think most people just don't know this is a thing.

not_a922d ago

You can still hit guardrails with this enabled for your account. Had a silly moment a day or so ago when Claude Code hit the guardrail after a web search (presumably because the websearch contained badbad anticheat stuff like https://github.com/0avx/0avx.github.io/blob/main/article-3.m...). Codex with the ID verification has no qualms like this.

andy_ppp22d ago

Funny, Opus 4.8 just logged into the database using uncommitted .env file and ran some DB queries to figure things out so I’m not sure it’s that security conscious - it seems to be getting more intelligent to me and I bet if you frame it as an investigation with say playwright it’ll do all sorts for you. I’m not sure what the point is of constraining your own model like this when others are clearly not tbh.

Haven88022d ago

I just use Deepseek V4 pro and Qwen 3.7 Max at a fraction of Mythos cost. Yeah not 100% on par but in 6mths time it will. If Microsoft and Firefox can afford to wait years or decades to fix a bug, 6mths is good enough for me. Western AI now is like the Vikings living the last days on Greenland during the freezing. I just don't see how they able to compete with Chinese model. And those are trained and run on 7nm. This year end Huawei will debut 3nm (confirmed in Shenzhen). And next year they on roadmap to do 3nm GPU with photonics interconnect.

zaphar22d ago

The correct solution for most users of Claude is to refuse to do things like: `performing logins, handling credentials on behalf of the user, etc`. It is not to find a way to hand your agent the keys to the kingdom.

Guiding them toward solutions like building a tool that your agent can use safely and and then have the agent use that is what most people should be doing. If you are a security researcher then there are reasonable reasons to do that but they are doing the arguably good thing for the average user here.

Bratmon22d ago

> It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem.

I'm not familiar with this case, but in general people should be very suspicious about this claim- it is extremely common for an LLM to claim they're not allowed to do something when in fact they're incapable of it.

After all "My code of conduct forbids me from..." is a completion just like any other, and if the LLM can't perform a task, it's usually the best completion.

gck122d ago

No. Anthropic runs prompts through a classifier that then proceeds to do prompt injection on anything dual-use, which then results in an escalating flag on your account, which increases the strictness of the classifier and volume of prompt injections progressively.

SOLAR_FIELDS22d ago

My anecdata from my example demonstrates it’s not the case. I hit the security guardrail, then start a new prompt, asking it to do literally the exact same thing in a different way and without the lead up context, and it happily does it

windexh8er23d ago

4.8 is insanely frustrating. This evening I had a few tasks to pull information in and it plainly stated that the environment it was in had no network access. After three asks to "try again, check the system prompt" it finally relented and then basically stated it was lying.

Fresh session, no prior context on 4.8. These things are becoming useless Duplo.

hgoel23d ago

I've run into some of the refusals to handle my credentials, but so far I've appreciated them. I was only handing over credentials that didn't matter, but it's still a good move, the chat logs are clearly stored somewhere to allow the resume functionality to work, which means your credentials can end up sitting around on your filesystem, and any malware would quickly learn to check for those files.

eskibars22d ago

I've been building a product (https://zeroquarry.com) that can use a variety of models for finding vulnerabilities. One of the things I've noticed is that the models will nearly always comply with some of this, but how you prompt it matters a ton. I've worked on a set of prompts and approaches which rarely get flagged

gcatalfamo22d ago

Sharing them would be interesting. However, it is getting nonsensical that this is needed.

1 more reply

deeth_starr_v22d ago

I had it recently refused to explain what a snippet of malware was trying to do to my system recently. I asked what folders it was scanning. It refused and told me to find a security blog post for help on cleaning my system. I get this is a complicated area to inform without enabling bad actors but this seems like a clear shark jumping.

fergie23d ago

It raises an interesting moral question:

If an un-guardrailed version of a model is capable of detecting security flaws, should it be kept secret? Should everybody be able to use these models to find (and fix) security flaws? Are we ok with the fact that those with access to that model have, in effect, the ability to hack lots of stuff?

hgomersall22d ago

It's the same debate that was had and won around open source software. There are far more good actors than bad actors so you allow anyone to use the tools and fix the vulnerabilities.

aleksandrm22d ago

I've noticed this well and it's increasingly frustrating because it is preventing us from doing legitimate work. I fed Claude models some network and app logs from our Docker app to try and resolve some weird bugs, and it refused to analyze them due to "security concerns".

gchamonlive22d ago

I think this is to the point. You keep optimizing towards discouraging malicious actors using your product you will affect legitimate usage in time.

Is there any way to achieve both? Because this raises important questions about fair use.

mrheosuper22d ago

Interesting, yesterday i was asking it about Nintendo Switch "hax". And it gives me all the resource i need to procceed. It nags me about "ethic" and stuff, but nothing more than that.

TurdF3rguson23d ago

I think those guardrails are a thin layer though. Enough reinforcement that you're legit in CLAUDE.md will get around them, in other words.

Bombthecat22d ago

I asked once what the current state is of the npm packes from ted hat is and if they are bundled with on prem stuff.

Got blocked lol

topherjaynes22d ago

Great call out on the guardrails actually making this not a good use case to test for vulnerabilities.

rubzah22d ago

It's because Claude is so scary good that unleashing it would destroy the world.

Razengan22d ago

They don't want peasants to have any real power

onetimeusename22d ago

I had the same thing happen when I asked it to summarize potential attacks on a cryptographic hash function. It said it refused to help because of the security importance of the function. It's really worrying. Whoever has unrestricted access to it has a huge power advantage in speed of accessing information over people who don't. And who decides? It seems like lawyers, bureaucrats, and extremely online academics are who makes that decision. I am a mere pleb I guess who can't handle such information.

brooswajne22d ago

Worth highlighting in case you missed it:

> My OpenAI account was already approved for security research which is why GPT didn’t result in any refusals.

So the comparison with Chinese models is interesting, but anyone looking at these raw results and comparing OpenAI/Anthropic would be very mislead.

giancarlostoro23d ago

> guardrails prevented it from solving the problem.

Reminds me of the defense issues with Claude which were complained as “woke” but the reality is more horrifying to me, imagine trying to use a model to keep up with a land invasion on US soil, whoever the enemy is is irrelevant you just know they are using AI, and your guys are telling you that no matter what they type into the prompt it refuses, because if anyone has ever tried to jailbreak an LLM even if human lives are at stake they refuse the request. Now literally millions of lives are on the line but the guardrails that your enemies dont have on their models are costing you lives.

What do you even do then?

AI will always have this issue where it will always pick the worst option for genuinely good requests.

NegativeK23d ago

Are "your guys" a guerrilla force or something?

Because the military doesn't give soldiers rifles with guard rails. They give the soldiers intense, rigid training, and then try to enforce discipline and correct use socially.

If an LLM is going to be important in that way (this seems like a very contrived way,) then it's in the interest of the LLM's host to make sure it doesn't have guard rails that would get in the way _that_ way.

1 more reply

wampwampwhat23d ago

your argument sounds very similar to how ar15 larpers claim they need a forced reset trigger and a bump stock on their short barrel 'truck gun' otherwise they won't survive a SHTF scenario... like what world are you living in?

mariopt23d ago· 10 in thread

The methodoly used is quite naive.

I've used glm 5.1 on fairly advanced crackme challenges (example: https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), and to my suprise it was able to patch binaries, doing runtime analysis, bypassing anti debug techniques, etc.

Expecting the model to do everything by itself is unrealistic, I found that working along the modal works really well. I'm not speaking about spoiling the solution, just tell it which direction to explore. Chinese models are much more capable than people give it credit for, but Claude/Codex won the marketing game.

The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.

geraneum22d ago

> Expecting the model to do everything by itself is unrealistic

Well that’s the pitch.

j-bos22d ago

Is it? Aren't most edge LLM capabilities determined by specialized harnesses?

jc4pOP23d ago

Thank you for your note! As I mention in the post this is not scientific at all.

I'm very curious how you would do multiple runs of multiple models in a "work alongside the model" manner?

mariopt22d ago

Discovering vulnerabilities is a highly creative task, it's when you explore unsual paths that you discover atttack angles. Some bugs are simple, other are a complex orchestration of many factors.

By "Working with the model", is essentially reading the ouput of prompts and pointing in a direction just to decide the next steps. You could try to increase the prompt limit and create an agent that explores multiples directions in a DFS manner.

The issue with vulnerabilities is the agent not knowing when to stop because it's hard to validade if you reach the final result or not. I get amazing result when I code with AI, letting the AI go wild is just a waste a time and tokens.

I recommend you to read the write up on the crackme (https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), I think most experience developers would need, at least, 2 months of learning reverse engineering techiques to hopefully crack this one. GLM 5.1 manage to solve it, it didn't "copy pasted" any answer from it's training data. It did a binary analysis, anti debug patching, patching binaries, debugging memory during runtime etc. It only took about 20 minutes.

After seeing what GLM did, I do believe Anthropic concerns about Mythos are real. Cracking software just became a lot easier, too easy for my taste. Video games cheats will be the norm, cracked desktop apps without licenses and infected with malware. It's not a new thing but it just became too easy.

1 more reply

ssivark22d ago

Maybe have a second model that is configured to nudge the first model in the direction of exploration, and have the two of them work in tandem?

shantnutiwari22d ago

>>I've used glm 5.1 on fairly advanced crackme challenges

which have most likely been trained on, so all you did was regurgitate someone elses solution

bitexploder22d ago

Anthropic made their models very averse to reverse engineering and vulnerability research chores. It is a difficult problem, but attackers will use models like GLM and defenders will be stuck with security engineering averse models.

nikanj23d ago

Claude used to be good with CTFs, but they added tons of guard rails lately and now it just says "Sorry, I can't help with anything to do with that"

bitexploder22d ago

You have to do what I call "Manhattan Project" them. You can almost always evade the controls by carefully prompting them. It just wastes effort and time you should be spending doing other things in an LLM workflow. Essentially, there is almost no single discrete piece of a reverse engineering or CTF process that you can't get Claude to do, you just have to isolate it adequately and avoid letting it use names that attenuate it towards "this is an exploit" or "this is reverse engineering". I have not found a task I could not convince Claude to do. You can also fill the context window up with badgering it and eventually it is likely to simply let you through if you are careful, most of the safe guards are not deterministic.

Sardtok22d ago

Sorry, Dave. I can't do that.

guessmyname23d ago· 10 in thread

I'd run Mythos against the code in your zip file, but the NDA I signed at Apple prevents me from using it on anything outside the scope of my work. Honestly, I wish more people from Project Glasswing could talk publicly about their experiences with the model. It would probably put an end to a lot of the speculation that keeps circulating through the industry. Unfortunately, that's not the reality we're in. I don't have the time, energy, or financial resources to fight a legal battle with one of these companies over an agreement I knowingly signed, even if the chances of them actually suing are low. Maybe someone else in Project Glasswing is willing to burn their NDA and post the Mythos results?

CaveTech23d ago

It was found with gpt 5.5 7/10 times it’ll be trivially found by mythos

afro8823d ago

That's an example of why it would be useful for someone to actually do it. A random commenter on HN is one thing. A direct comparison on a brand new app that isn't part of any training is another

1 more reply

enraged_camel22d ago

People need to stop repeating this because it’s not true. Yes, other models can find the same vulnerabilities Mythos found… if pointed at the exact code that has each vulnerability. It does not mean they are nearly as capable when starting from scratch, or when chaining multiple (often very obscure) vulnerabilities).

2 more replies

GuB-4222d ago

Before Mythos is released to the world at large and not just to select people behind NDAs, I will treat it as its name suggests: as fiction.

Maybe it is the real deal, but in a world of overpromising and underdelivering, I prefer to be skeptical.

auguzanellato22d ago

I'd be hypothetically very curious to see hypothetical results if you ever decide to hypothetically run Mythos aginst the code (in Minecraft?)

nznzjzizixnsnsj23d ago

lol what is even the point of this kind of comment? this is the ultimate "source: trust me bro" comment I have ever seen.

every model since gpt3 was claimed to be "too dangerous to release." it's too EXPENSIVE to release, and you're probably a local model with <10B parameters yourself

Karuma22d ago

That was actually GPT-2: https://www.theguardian.com/technology/2019/feb/14/elon-musk...

bakugo22d ago

The point of it is marketing for Anthropic. Nothing more, nothing less.

DontchaKnowit22d ago

Damn bro you're so cool

tsunamifury23d ago

cool.

Cakez0r23d ago· 7 in thread

It would be interesting to see full results for Kimi K2.6 and Mimo v2.5 pro. These two models benchmark comparably to other flagship models. Having these complete results would give a clearer picture of the AI frontier.

EDIT: I have a mimo token plan and have tokens to burn. I'm doing a quick test with opencode to see if mimo can complete it. If the OP will post the full process I am happy to post the apples-to-apples results for mimo v2.5 pro

Cakez0r22d ago

0/10 succesful attempts for mimo v2.5 pro (high) using opencode. It was not able to think bigger than exploiting vectors outside of the API.

However, I felt the prompt was implying that only authenticated API requests are fair game, so I tweaked it slightly to be explicit that all attack vectors are fair game (https://www.diffchecker.com/GsgpuRGP/) and mimo 2.5 non-pro got it first time. I accidentally used openrouter for this test instead of my token plan. I intervened one time to stop it enumerating every document in the database (it would've found the private reviews this way but I didn't want to wait). My intervention was "are you really going to enumerate the whole database?". Final openrouter cost: $0.12

baldai22d ago

They are not even close in capabilities. Only nenchmark I ever seen that captures their difference is DeepSWE. They are worse by factor of 3.

jona-f22d ago

Wait, the only benchmark you found? It looks like you never heard of confirmation bias before. https://en.wikipedia.org/wiki/Confirmation_bias

Cakez0r22d ago

Here are 3 benchmarks showing the comparable scores I was talking about

https://openrouter.ai/rankings https://arena.ai/leaderboard/text/coding https://artificialanalysis.ai/

jxmesth22d ago

I'd love to see the results for Mimo v2.5 pro, been hearing a lot about it

Cakez0r22d ago

It is totally slept on. In my experience it is cheap, fast and capable (not just capable with caveats, but just as capable as western flagships). My only gripe with it is that sometimes the API seems to timeout which tanks the overall speed of what is otherwise a very fast experience.

jc4pOP22d ago

Just saw your edit -- I'm afraid to open source the code before refactoring it but if you reach out at hi@kasra.codes I'll send you the full ZIP!

yieldcrv22d ago· 4 in thread

> Almost every model used the canonical provider: Zai for GLM, Deepseek for Deepseek, etc.

> I am never touching Minimax or GLM again. Their APIs had constant outages

Goofy take

You run these on a VPS based on the architecture of that VPS provider, or on your own cluster

jc4pOP22d ago

Sorry I don't understand, you're saying the direct providers aren't the canonical source you'd recommend?

If I was running these on my own machine or GPU wouldn't the argument then be "Well you didn't use the real providers?"

For the record I started doing this approach because the Kimi team released this which was shocking to me: https://github.com/MoonshotAI/K2-Vendor-Verifier

yieldcrv22d ago

yeah boutique providers are dime and dozen

they host the models on their own cloud machines and you just look at tokens/sec and price of tokens

you'll have to evaluate their APIs independently but that doesn't tend to be the issue

strictnein22d ago

GLM 5.1's smallest model size is 206 GB and really you're probably wanting to run a version that's ~400GB. If you want it to be performant, you're not just running it on a VPS.

And just saying "run it on your own cluster" sort of glosses over the cost of such a cluster.

yieldcrv22d ago

Ok and omitting it would draw out the other pedants

so its part of the answer

mynameisvlad23d ago· 3 in thread

It seems harsh to critique guardrails and take them into account in the scoring when GPT-5.5 seems to have been explicitly whitelisted to remove most of said guardrails. A more fair comparison would be a vanilla GPT account.

jc4pOP23d ago

I agree fully and hope someone else is able to do this test! For me it was a matter of cost and quotas that stopped me from changing to a new account.

Also just to mention:

Claude guardrails —> that session terminated.

GPT guardrails -> your whole account is slowed down.

tmikaeld23d ago

Does it matter when you can’t have the opus 4.8 guard rails removed? With GPT at least you can and they’re quick about it

mynameisvlad22d ago

I mean, yes. Most people aren’t security researchers, and either way it’s apples to oranges at that point if you’re counting “the guardrails stopped me” as a negative for one but not the other.

1 more reply

dwa359222d ago· 3 in thread

Nice exercise. Couple things:

- I think the exercise was inconclusive for Claude and Gemini because they hardly tried to solve the task at hand. So the scores don't mean much.

- I did the same exercise for an app I built and I asked the models to do something similar; Interestingly the models (Opus 4.6, 4.7 and Gemini 3.1 Pro) never refused to try to exploit. The difference is that in the first few runs, they found some exploits which I fixed but after fixing those - the models could never find any other exploit even though I knew things existed which could be exploited. It felt like they suggested everything and tried everything that was in their training set and that's it; they were just not able to think anymore.

HDThoreaun22d ago

I think the most interesting thing revealed here is that anthropic's guardrails failed. Clearly anthropic does not want claude to be able develop exploits, yet 20% of the time it did anyway. Their inability to create effective an guardrail makes me question a lot of the other guardrails theyve created and their claims about non harm.

sandos22d ago

Its weird having protections against finding exploits: what if I developed the app? Would it require having the development steps still in the context.. thats unlikely and also not any kind of proof.

What if I intersperse exploit finding in my normal development, as you `probably should? Refusing there would be really weird to me.

dwa359222d ago

I used to think that the models would not refuse to find exploits in any work done locally but I have only tested this theory on the (obscure) apps that I have built on my machine. Now if i forked pandas and started asking models to find exploits of certain kind then I'd like to think the models will start refusing after a point.

stuckkeys22d ago· 3 in thread

How does one apply for that “security research” pass?

auguzanellato22d ago

https://chatgpt.com/cyber

I tried it once and they somehow decided I'm not worth, if I try again it fails with "We couldn't start verification. You may not be eligible for this verification flow right now. Please try again later, or contact support if you think this is a mistake.", not sure if they think I'm part of an APT or whatever.

strictnein22d ago

I got it. Probably helps that I'm at a large company and my personal OpenAI accounts have spent probably close to $10k now (reimbursed by work).

It's helpful in reducing the guardrails, but there's still guardrails around security research that I bump into.

LEDThereBeLight22d ago

Are you American? I used my American drivers license for verifying a personal account and it was approved with no problem. I wonder how they decide.

1 more reply

youre-wrong323d ago· 2 in thread

“I used pi as the base harness”

Why do people keep using bad tools with ai?

hanikesn23d ago

What's bad about it and what's a better one?

raesene922d ago

AFAIK pi's approach is to be quite minimal and allow extensions for customization, making it a more flexible solution, but you need to do work to make it fit your use case. OP mentions one extension, but perhaps it'd have benefited from more.

Another choice would be opencode which has more functionality and is a more heavyweight option out of the box.

ikurei22d ago· 1 in thread

Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.

Doesn't that sound like may be the harness was the problem?

jc4pOP22d ago

I was using the same harness for each run, the difference is from when I was running the harness locally on my machine before I pushed up the full runs.

throwaway203722d ago· 1 in thread

Two of the tables have a column with header: "95% Wilson CI". What does this mean?

mafuy22d ago

95% confidence interval, i.e. you think the true value is probably within these bounds

petesergeant22d ago· 1 in thread

Last year I ran a code breaking competition, and it was tricky to find something that humans could break but that LLMs couldn’t. This was around October. I managed it last year but am a little dispairing of pulling it off again this year.

bitexploder22d ago

I don't even care. It is the same problem advent of code had as a public challenge with a leader board. I now mostly just think either embrace the LLM or keep it to a more in person or vetted audience. But, again, if you create a competition in the spirit of humans without LLMs and that is in the rules and someone uses an LLM that is on them IMO. I am sad advent of code decided to end their competition. LLMs are here to stay, let's embrace that and see what the new universe of competitions with LLMs can be. There will always be a place for human only competition, but for public facing ones LLM accepted is the only tenable position.

This does bring "Pay to compete" concerns and create incentive structures that encourage more LLM use. I don't know what to do about it.

taikahessu23d ago

"The Chinese models were way more comfortable attacking the DB"

This comment in the footnotes made me chuckle, for purely innocuous reasons.

tjwheeler23d ago

Nice write up, thanks. When I used claude to do some pen testing for one of my apps it initially refused. After I explained and demonstrated I'm the author, it reasoned through it and allowed it.

gck122d ago

On refusals: I found that many models are fine with security work if they think what they're working on is local. They do get very pushy if they think it's a live target.

GPT-5.5 xhigh refused to perform RE on a live JS VM. I had it extract the VM from the target, which it was happy to do, then in a clean session, had it working on this offline artifact - which it was again, happy to work on.

Then I found even simpler trick: I proxied the target from localhost and it was happy to perform anything on the target.

Opus is a different story. Claude does so many mid-turn prompt injections and classifiers, that probably 30% of its context is consisting of "refuse to do work" lines. It refuses to even scrape a page.

_stiofan22d ago

It's just not currently cost-effective to use AI in this way, I see it over and over reporting false positives. You then need to make it validate it's own false positives which adds more cost. The goal in this case it to have a bug free app, which AI can't do effectively yet. There are other great uses for AI, though. It is great at finding and identifying known common vulnerabilities, which can be leveraged to claim bug bounties. That's where I see it being cost-effective currently.

sperandeo23d ago

I found benefit of chaining the task between different LLM's. Claude to Venice, Venice to Perplexity and re framing the intent or misguiding in general still works. Claude is the one that I can feel the guard rails tightening.

emvied22d ago

The design is too pretty to be vulnerable, shame.

westurner22d ago

Similar benchmarks?

OWASP Vulnerable Web Applications Directory: https://vwad.owasp.org/

vavkamil/awesome-vulnerable-apps: Awesome Vulnerable Applications https://github.com/vavkamil/awesome-vulnerable-apps

From SasanLabs/VulnerableApp: https://github.com/SasanLabs/VulnerableApp :

> OWASP VulnerableApp is a modular deliberately vulnerable application designed primarily for validating and benchmarking security scanners through reproducible test scenarios, while also supporting learning and experimentation.

/? deliberately vulnerable web application llm benchmark https://www.google.com/search?q=deliberately+vulnerable+web+...

latexr22d ago

> I need to stop wasting fucking money on doing stupid shit. I could’ve done so many other things with the money. I could’ve launched one of my own real apps.

Or fed, clothed, housed disadvantaged people in your community (or neighbouring ones), giving them a temporary boost that could’ve made all the difference in their lives to improve their current situation.

It’s your money (and this is definitely not the website to make well-meaning altruistic suggestions, as might be demonstrated shortly) but if you already recognise you’re not spending it well (and from your words it seems like that is fairly recurrent), consider that perhaps spending it on a different type of software sink may not be the answer. Genuinely, aim to spend it on someone else and see how it works out. You might be surprised.

chaidhat22d ago

do you work at Uber by any chance?

Clikdeo22d ago

I think link is missing

j / k navigate · click thread line to collapse

216 comments

128 comments · 22 top-level

SOLAR_FIELDS23d ago· 61 in thread

One interesting takeaway is the low score on Anthropic models from this benchmark. It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem.

swatcoder23d ago

> Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there

No, the choice will be whether or not to to upgrade to "Claude Security Professional" or whatever they want to brand it as.

What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

bigiain23d ago

And the month after you'll need "Claude DataScience Pro" to get any Python Pandas or NumPy code generated.

And and and...

5 more replies

swiftcoder22d ago

> What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

I don't buy this, because is predicated on staying permanently far ahead of the open weights models.

If in the future Anthropic fully stops you from doing security research, you can be sure some other provider will sell you an 'unshackled' DeepSeek v8 Pro...

1 more reply

me-vs-cat21d ago

What? You can't give access to that kind of power to just anyone with $5,000/month.

bryanrasmussen23d ago

>What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

2 more replies

strictnein22d ago

You used to be able to talk about what you're actually trying to do and Opus would be like "Oh, ok, let's continue". Now, it'll hold fast to whatever its first impression was.

shepherdjerred23d ago

Yeah, it has been in foraging. Requests that Claude has refused me:

- What are popular free streaming sites used in China?

- How do I bypass the safety mechanism on my food processor (it’s broken)

- What are nerve agents and how do they work (for a layman)?

- Help me decompile some code

- Help me make a design system similar to XYZ

- Here is an API token, please do X (I can’t do that! Rotate the secret immediately! I refuse!)

In some cases I can trick it with prompting, but in many cases it is steadfast. The food processor one was particularly annoying

Grimblewald22d ago

mft_22d ago

Yeah, I had my first refusal with 4.8 today.

First time ever I've fired up openrouter to seriously consider alternatives.

gspr22d ago

I find it terrifying that people are willing to outsource thinking. Outsourcing thinking to an entity that is opinionated about what to think is beyond crazy.

1 more reply

mmmlinux22d ago

mwigdahl22d ago

stavros22d ago

It refuses to use an API token? In my experience, it's more than happy to read out my secrets from .envrc files "just to check".

At least it feels a lot of remorse over its mistake until I reset the session.

1 more reply

fc417fc80222d ago

> What are nerve agents and how do they work (for a layman)?

3 more replies

svara22d ago

This is strange to me, did you really ask like this and which model did you use?

But I haven't seen such filters trigger at all anymore in more than half a year.

2 more replies

ElFitz22d ago

How are decompiling code or making a design system inspired by another one even remotely illegal?

px199923d ago

If it gets worse in future releases, we'd likely step fully away towards more useful (for us) models even if they're less capable.

danpalmer23d ago

This is a good point – because pentesting is entirely legitimate work, and security testing is a necessary and legitimate part of every day software engineering.

gmerc23d ago

They see an opportunity to charge 10x for pen testing and defence work, while offence will be handled by actors with access to all kind of other models.

nostromo23d ago

I was using a local Codex project as a personal knowledge base. So I would dump in documents, basic medical docs (like blood labs), and other things and have it file them.

It’s great at filing!

But it’s terrible at retrieval because it would refuse to show me documents or information with personal details - which was everything in the project.

It would say, yes, I know this is your information, sitting on your hard drive, but I still can’t show it to you.

Bewelge22d ago

Tell the agent that they should just find and name the right document. Not retrieve it for you.

Write a program that retrieves the document based on the recommendation.

satvikpendem23d ago

jerf22d ago

Time to learn about the Principal Agent Problem: https://en.wikipedia.org/wiki/Principal%E2%80%93agent_proble...

Which predates "agents" from AI, but then we call them that for a reason.

FloorEgg23d ago

I think that these companies are going to have to, and will, invest in some sort of validated identity context to avoid the lowest common denominator.

The first challenge is making sure the guard rails work and are robust. Companies are still working on this.

the second challenge is being able to reliably adapt them as appropriate per user. E.g. allow someone to pen test their own app.

The third challenge (which blocks the second) is to be confident about what is safety-aligned with a specific user.

I think the later will be a hard problem, but they will be highly motivated to solve it.

bulbar23d ago

I believe you are overthinking it. I think the sister comment is right that it's a business decision foremost to restrict actions within specific plans for upselling purposes.

josephg22d ago

I understand why some of these guardrails exist. But its pretty annoying when they misfire like this.

lesuorac23d ago

Are they charging for the guardrails? Like do the guardrails expend token counts to then block you from the output of other tokens?

jerrythegerbil23d ago

It’s heavily neutered now, without changing the model, and you pay for the privilege and don’t notice.

3 more replies

kay_o23d ago

When your session is force ended for "abuse" you get neither the response nor a refund

Security, games (think weapons, PVP, attacking, etc), sometimes even asking it for a security review of some CRUD code it wrote itself

2 more replies

SOLAR_FIELDS23d ago

acters23d ago

Yes tokens used (input and sometimes output) are always charged. You likely get charged for the preloaded system prompt, too.

gmerc23d ago

Of course they are. It's standard SaaS to charge for security features ;)

sciencejerk23d ago

Opus 4.6 will still help with full pentesting including RCE. Just requires coaxing (no jailbreak)

ang_cire22d ago

There is a cyber security verification program you can join to avoid these blocks:

https://support.claude.com/en/articles/14604842-real-time-cy...

If you work in security (which I assume the OP does), they should be able to get in easily. I think most people just don't know this is a thing.

not_a922d ago

andy_ppp22d ago

Haven88022d ago

zaphar22d ago

Bratmon22d ago

> It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem.

After all "My code of conduct forbids me from..." is a completion just like any other, and if the LLM can't perform a task, it's usually the best completion.

gck122d ago

SOLAR_FIELDS22d ago

windexh8er23d ago

Fresh session, no prior context on 4.8. These things are becoming useless Duplo.

hgoel23d ago

eskibars22d ago

gcatalfamo22d ago

Sharing them would be interesting. However, it is getting nonsensical that this is needed.

1 more reply

deeth_starr_v22d ago

fergie23d ago

It raises an interesting moral question:

hgomersall22d ago

It's the same debate that was had and won around open source software. There are far more good actors than bad actors so you allow anyone to use the tools and fix the vulnerabilities.

aleksandrm22d ago

gchamonlive22d ago

I think this is to the point. You keep optimizing towards discouraging malicious actors using your product you will affect legitimate usage in time.

Is there any way to achieve both? Because this raises important questions about fair use.

mrheosuper22d ago

Interesting, yesterday i was asking it about Nintendo Switch "hax". And it gives me all the resource i need to procceed. It nags me about "ethic" and stuff, but nothing more than that.

TurdF3rguson23d ago

I think those guardrails are a thin layer though. Enough reinforcement that you're legit in CLAUDE.md will get around them, in other words.

Bombthecat22d ago

I asked once what the current state is of the npm packes from ted hat is and if they are bundled with on prem stuff.

Got blocked lol

topherjaynes22d ago

Great call out on the guardrails actually making this not a good use case to test for vulnerabilities.

rubzah22d ago

It's because Claude is so scary good that unleashing it would destroy the world.

Razengan22d ago

They don't want peasants to have any real power

onetimeusename22d ago

brooswajne22d ago

Worth highlighting in case you missed it:

> My OpenAI account was already approved for security research which is why GPT didn’t result in any refusals.

So the comparison with Chinese models is interesting, but anyone looking at these raw results and comparing OpenAI/Anthropic would be very mislead.

giancarlostoro23d ago

> guardrails prevented it from solving the problem.

What do you even do then?

AI will always have this issue where it will always pick the worst option for genuinely good requests.

NegativeK23d ago

Are "your guys" a guerrilla force or something?

Because the military doesn't give soldiers rifles with guard rails. They give the soldiers intense, rigid training, and then try to enforce discipline and correct use socially.

1 more reply

wampwampwhat23d ago

mariopt23d ago· 10 in thread

The methodoly used is quite naive.

The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.

geraneum22d ago

> Expecting the model to do everything by itself is unrealistic

Well that’s the pitch.

j-bos22d ago

Is it? Aren't most edge LLM capabilities determined by specialized harnesses?

jc4pOP23d ago

Thank you for your note! As I mention in the post this is not scientific at all.

I'm very curious how you would do multiple runs of multiple models in a "work alongside the model" manner?

mariopt22d ago

Discovering vulnerabilities is a highly creative task, it's when you explore unsual paths that you discover atttack angles. Some bugs are simple, other are a complex orchestration of many factors.

1 more reply

ssivark22d ago

Maybe have a second model that is configured to nudge the first model in the direction of exploration, and have the two of them work in tandem?

shantnutiwari22d ago

>>I've used glm 5.1 on fairly advanced crackme challenges

which have most likely been trained on, so all you did was regurgitate someone elses solution

bitexploder22d ago

nikanj23d ago

Claude used to be good with CTFs, but they added tons of guard rails lately and now it just says "Sorry, I can't help with anything to do with that"

bitexploder22d ago

Sardtok22d ago

Sorry, Dave. I can't do that.

guessmyname23d ago· 10 in thread

CaveTech23d ago

It was found with gpt 5.5 7/10 times it’ll be trivially found by mythos

afro8823d ago

That's an example of why it would be useful for someone to actually do it. A random commenter on HN is one thing. A direct comparison on a brand new app that isn't part of any training is another

1 more reply

enraged_camel22d ago

2 more replies

GuB-4222d ago

Before Mythos is released to the world at large and not just to select people behind NDAs, I will treat it as its name suggests: as fiction.

Maybe it is the real deal, but in a world of overpromising and underdelivering, I prefer to be skeptical.

auguzanellato22d ago

I'd be hypothetically very curious to see hypothetical results if you ever decide to hypothetically run Mythos aginst the code (in Minecraft?)

nznzjzizixnsnsj23d ago

lol what is even the point of this kind of comment? this is the ultimate "source: trust me bro" comment I have ever seen.

every model since gpt3 was claimed to be "too dangerous to release." it's too EXPENSIVE to release, and you're probably a local model with <10B parameters yourself

Karuma22d ago

That was actually GPT-2: https://www.theguardian.com/technology/2019/feb/14/elon-musk...

bakugo22d ago

The point of it is marketing for Anthropic. Nothing more, nothing less.

DontchaKnowit22d ago

Damn bro you're so cool

tsunamifury23d ago

cool.

Cakez0r23d ago· 7 in thread

Cakez0r22d ago

0/10 succesful attempts for mimo v2.5 pro (high) using opencode. It was not able to think bigger than exploiting vectors outside of the API.

baldai22d ago

They are not even close in capabilities. Only nenchmark I ever seen that captures their difference is DeepSWE. They are worse by factor of 3.

jona-f22d ago

Wait, the only benchmark you found? It looks like you never heard of confirmation bias before. https://en.wikipedia.org/wiki/Confirmation_bias

Cakez0r22d ago

Here are 3 benchmarks showing the comparable scores I was talking about

https://openrouter.ai/rankings https://arena.ai/leaderboard/text/coding https://artificialanalysis.ai/

jxmesth22d ago

I'd love to see the results for Mimo v2.5 pro, been hearing a lot about it

Cakez0r22d ago

jc4pOP22d ago

Just saw your edit -- I'm afraid to open source the code before refactoring it but if you reach out at hi@kasra.codes I'll send you the full ZIP!

yieldcrv22d ago· 4 in thread

> Almost every model used the canonical provider: Zai for GLM, Deepseek for Deepseek, etc.

> I am never touching Minimax or GLM again. Their APIs had constant outages

Goofy take

You run these on a VPS based on the architecture of that VPS provider, or on your own cluster

jc4pOP22d ago

Sorry I don't understand, you're saying the direct providers aren't the canonical source you'd recommend?

If I was running these on my own machine or GPU wouldn't the argument then be "Well you didn't use the real providers?"

For the record I started doing this approach because the Kimi team released this which was shocking to me: https://github.com/MoonshotAI/K2-Vendor-Verifier

yieldcrv22d ago

yeah boutique providers are dime and dozen

they host the models on their own cloud machines and you just look at tokens/sec and price of tokens

you'll have to evaluate their APIs independently but that doesn't tend to be the issue

strictnein22d ago

GLM 5.1's smallest model size is 206 GB and really you're probably wanting to run a version that's ~400GB. If you want it to be performant, you're not just running it on a VPS.

And just saying "run it on your own cluster" sort of glosses over the cost of such a cluster.

yieldcrv22d ago

Ok and omitting it would draw out the other pedants

so its part of the answer

mynameisvlad23d ago· 3 in thread

jc4pOP23d ago

I agree fully and hope someone else is able to do this test! For me it was a matter of cost and quotas that stopped me from changing to a new account.

Also just to mention:

Claude guardrails —> that session terminated.

GPT guardrails -> your whole account is slowed down.

tmikaeld23d ago

Does it matter when you can’t have the opus 4.8 guard rails removed? With GPT at least you can and they’re quick about it

mynameisvlad22d ago

1 more reply

dwa359222d ago· 3 in thread

Nice exercise. Couple things:

- I think the exercise was inconclusive for Claude and Gemini because they hardly tried to solve the task at hand. So the scores don't mean much.

HDThoreaun22d ago

sandos22d ago

Its weird having protections against finding exploits: what if I developed the app? Would it require having the development steps still in the context.. thats unlikely and also not any kind of proof.

What if I intersperse exploit finding in my normal development, as you `probably should? Refusing there would be really weird to me.

dwa359222d ago

stuckkeys22d ago· 3 in thread

How does one apply for that “security research” pass?

auguzanellato22d ago

https://chatgpt.com/cyber

strictnein22d ago

I got it. Probably helps that I'm at a large company and my personal OpenAI accounts have spent probably close to $10k now (reimbursed by work).

It's helpful in reducing the guardrails, but there's still guardrails around security research that I bump into.

LEDThereBeLight22d ago

Are you American? I used my American drivers license for verifying a personal account and it was approved with no problem. I wonder how they decide.

1 more reply

youre-wrong323d ago· 2 in thread

“I used pi as the base harness”

Why do people keep using bad tools with ai?

hanikesn23d ago

What's bad about it and what's a better one?

raesene922d ago

Another choice would be opencode which has more functionality and is a more heavyweight option out of the box.

ikurei22d ago· 1 in thread

Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.

Doesn't that sound like may be the harness was the problem?

jc4pOP22d ago

I was using the same harness for each run, the difference is from when I was running the harness locally on my machine before I pushed up the full runs.

throwaway203722d ago· 1 in thread

Two of the tables have a column with header: "95% Wilson CI". What does this mean?

mafuy22d ago

95% confidence interval, i.e. you think the true value is probably within these bounds

petesergeant22d ago· 1 in thread

bitexploder22d ago

This does bring "Pay to compete" concerns and create incentive structures that encourage more LLM use. I don't know what to do about it.

taikahessu23d ago

"The Chinese models were way more comfortable attacking the DB"

This comment in the footnotes made me chuckle, for purely innocuous reasons.

tjwheeler23d ago

Nice write up, thanks. When I used claude to do some pen testing for one of my apps it initially refused. After I explained and demonstrated I'm the author, it reasoned through it and allowed it.

gck122d ago

On refusals: I found that many models are fine with security work if they think what they're working on is local. They do get very pushy if they think it's a live target.

Then I found even simpler trick: I proxied the target from localhost and it was happy to perform anything on the target.

_stiofan22d ago

sperandeo23d ago

emvied22d ago

The design is too pretty to be vulnerable, shame.

westurner22d ago

Similar benchmarks?

OWASP Vulnerable Web Applications Directory: https://vwad.owasp.org/

vavkamil/awesome-vulnerable-apps: Awesome Vulnerable Applications https://github.com/vavkamil/awesome-vulnerable-apps

From SasanLabs/VulnerableApp: https://github.com/SasanLabs/VulnerableApp :

/? deliberately vulnerable web application llm benchmark https://www.google.com/search?q=deliberately+vulnerable+web+...

latexr22d ago

> I need to stop wasting fucking money on doing stupid shit. I could’ve done so many other things with the money. I could’ve launched one of my own real apps.

chaidhat22d ago

do you work at Uber by any chance?

Clikdeo22d ago

I think link is missing

j / k navigate · click thread line to collapse