I bet there's loads of cryptocurrency implementations being pored over right now - actual money on the table.
Usually I ask something like this:
"This code has a bug. Can you find it?"
Sometimes I also tell it that "the bug is non-obvious"
Which I've anecdotally found to have a higher rate of success than just asking for a spot check
I've seen that when prompting it to look for concurrency issues vs saying something more like "please inspect this rigorously to look for potential issues..."
...because false positives are good errors. false negatives is what i'm worried about.
i feel massively more sure that something has no big oversights if multiple runs (or even multiple different models) cannot find anything but false positives
Since it's a large codebase, they go even more specific and hint that the bug is in file A, then try again with a hint that the bug is in file B, and so on.
it's like the idea behind the book _The Mom Test_ suddenly got very important for programming
I was very impressed when the top three AIs all failed to find anything other than minor stylistic nitpicks in a huge blob of what to me looked like “spaghetti code” in LLVM.
Meanwhile at $dayjob the AI reviews all start with “This looks like someone’s failed attempt at…”
If anyone, or anything, ever answers a question like that, you should stop asking it questions.
And if it's not giving me a reason I can understand for a bug, I'm not listening to it! Mostly it is showing me I've mixed up two parameters, forgotten to initialise something, or referenced a variable from a thread that I shouldn't have.
The immediate feedback means the bug usually gets a better-quality fix than it would if I had got fatigued hunting it down! So variables get renamed to make sure I can't get them mixed up, a function gets broken out. It puts me in the mind of "well make sure this idiot can't make that mistake again!"
It's actually the main way I use CC/codex.
One particularly striking example: I had CC do some work and then kicked off a "/codex-review" and while it was running went to test the changes. I found a deadlock but when I switched back to CC the Codex review had found the deadlock and Claude Code was already working on a fix.
https://github.com/openai/codex-plugin-cc
I actually work the other way around. I have codex write "packets" to give to claude to write. I have Claude write the code. Then have Codex review it and find all the problems (there's usually lots of them).
Only because this month I have the $100 Claude Code and the $20 Codex. I did not renew Anthropic though.
Go has a built in race detector which may be useful for this too: https://go.dev/doc/articles/race_detector
Unsure if it's suitable for inclusion in CI, but seems like something worth looking into for people using Go.
ive had some remarkable successes with claude and quite a few "well that was a total waste of time" efforts with claude. for the most part i think trying to do uncharted/ambitious work with claude is a huge coinflip. he's great for guardrailed and well understood outcomes though, but im a little burnt out and unexcited at hearing about the gigantic-claude exercises.
declares a 1024-byte owner ID, which is an unusually long but legal value for the owner ID.
When I'm designing protocols or writing code with variable-length elements, "what is the valid range of lengths?" is always at the front of my mind.
it uses a memory buffer that’s only 112 bytes. The denial message includes the owner ID, which can be up to 1024 bytes, bringing the total size of the message to 1056 bytes. The kernel writes 1056 bytes into a 112-byte buffer
This is something a lot of static analysers can easily find. Of course asking an LLM to "inspect all fixed-size buffers" may give you a bunch of hallucinations too, but could be a good starting point for further inspection.
Static analyzers will find all possible copies of unbounded data into smaller buffers (especially when the size of the target buffer is easily deduced). It will then report them whether or not every path to that code clamps the input. Which is why this approach doesn't work well in the Linux kernel in 2026.
Well yeah. There weren't enough "someones" available to look. There are a finite number of qualified individuals with time available to look for bugs in OSS, resulting in a finite amount of bug finding capacity available in the world.
Or at least there was. That's what's changing as these models become competent enough to spot and validate bugs. That finite global capacity to find bugs is now increasing, and actual bugs are starting to be dredged up. This year will be very very interesting if models continue to increase in capability.
Many people with skin in the game will be spending tokens on hardening OSS bits they use, maybe even part of their build pipelines, but if the code is closed you have to pay for that review yourself, making you rather uncompetitive.
You could say there's no change there, but the number of people who can run a Claude review and the number of people who can actually review a complicated codebase are several orders of magnitude apart.
Will some of them produce bad PRs? Probably. The battle will be to figure out how to filter them at scale.
An avalanche of 0-day in proprietary code is coming.
And yet they didn't (either noone ran them, or they didn't find it, or they did find it but it was buried in hundreds of false positives) for 20+ years...
I find it funny that every time someone does something cool with LLMs, there's a bunch of takes like this: it was trivial, it's just not important, my dad could have done that in his sleep.
If you know what you're doing you can split the code up in smaller chunks where you can look with more depth in a timely fashion.
Horizontally scaling "sleeping dads" takes decades, but inference capacity for a sleeping dad equivalent model can be scaled instantly, assuming one has the hardware capacity for it. The world isn't really ready for a contraction of skill dissemination going from decades to minutes.
Or put another way, the context matters.
(Sorry. I couldn’t resist lol)
They want me to code AI-first, and the amount of hallucinations and weird bugs and inconsistencies that Claude produces is massive.
Lots of code that it pushes would NOT have passed a human/human code review 6 months ago.
It can properly excel in some things while being less than helpful in others. These are computers from the beginning, 1000x rehashed and now with an extra twist.
> I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet
You have "so many?" Are they uncountable for some reason? You "haven't validated" them? How long does that take?
> found a total of five Linux vulnerabilities
And how much did it cost you in compute time to find those 5?
These articles are always fantastically light on the details which would make their case for them. Instead it's always breathless prognostication. I'm deeply suspicious of this.
This is the last thing I'd worry about if the bug is serious in any way. You have attackers like nation states that will have huge budgets to rip your software apart with AI and exploit your users.
Also there have been a number of detailed articles about AI security findings recently.
Like what, you expect Nicholas to test each vuln when he has more important work to do (ie his actual job?)
I understand how people use those tools if all they do is build CRUD endpoints and UIs for those endpoints (which is admittedly what most programmers probably do for their job). But for anything that requires any sort of problem solving skills, I don't understand how people use them. I feel like I live in a completely different world from some of the people who push agentic coding.
At my company they set it up that way.
Time to update that:
"given 1 million tokens context window, all bugs are shallow"
Running LLM on 1000 functions produces 10000 reports (these numbers are accurate because I just generated them) — of course only the lottery winners who pulled the actually correct report from the bag will write an article in Evening Post
Is it sarcasm, or you really did this? Claude Opus 4.6?
(disabled io_uring, would have crashed the kernel on UAF, and made exploitation of the heap overflow very unreliable)
Stream of vulnerabilities discovered using security agents (23 so far this year): https://securitylab.github.com/ai-agents/
Taskflow harness to run (on your own terms): https://github.blog/security/how-to-scan-for-vulnerabilities...
It's definitely possible to do a basic pass for much less (I do this with autopen.dev), but it is still very expensive to exhaustively find the harder vulnerabilities.
$0.001 (1/10 of a cent) or 0.001 cents (1/1000 of a cent, or $0.00001)?
In order to justify higher prices the SotA needs to have way higher capabilities than the competition (hence justifying the price) and at the same time the competition needs to be way below a certain threshold. Once that threshold becomes "good enough for task x", the higher price doesn't make sense anymore.
While there is some provider retention today, it will be harder to have once everyone offers kinda sorta the same capabilities. Changing an API provider might even be transparent for most users and they wouldn't care.
If you want to have an idea about token prices today you can check the median for serving open models on openrouter or similar platforms. You'll get a "napkin math" estimate for what it costs to serve a model of a certain size today. As long as models don't go oom higher than today's largest models, API pricing seems in line with a modest profit (so it shouldn't be subsidised, and it should drop with tech progress). Another benefit for open models is that once they're released, that capability remains there. The models can't get "worse".
Inference cost has dropped 300x in 3 years, no reason to think this won't keep happening with improvements on models, agent architecture and hardware.
Also, too many people are fixated with American models when Chinese ones deliver similar quality often at fraction of a cost.
From my tests, "personality" of an LLM, it's tendency to stick to prompts and not derail far outweights the low % digit of delta in benchmark performance.
Not to mention, different LLMs perform better at different tasks, and they are all particularly sensible to prompts and instructions.
Expensive for me and you, but peanuts for a nation state.
Very likely not nearly as well, unless there are many open source libraries in use and/or the language+patterns used are extremely popular. The really huge win for something like the Linux kernel and other popular OSS is that the source appears in the training data, a lot. And many versions. So providing the source again and saying "find X" is primarily bringing into focus things it's already seen during training, with little novelty beyond the updates that happened after knowledge cutoff.
Giving it a closed source project containing a lot of novel code means it only has the language and it's "intuition" to work from, which is a far greater ask.
The llms know about every previous disclosed security vulnerability class and can use that to pattern match. And they can do it against compiled and in some cases obfuscated code as easily as source.
I think the security engineers out there are terrified that the balance of power has shifted too far to the finding of closed source vulnerabilities because getting patches deployed will still take so long. Not that the llms are in some way hampered by novel code bases.
Do the reports include patterns that could be matched against decompiled code, though? As easily as they would against proper source? I find it a bit hard to believe.
Same thing applies to humans: the better someone knows a codebase, the better they will be at resolving issues, etc.
Simply because the source code contains names that were intended to communicate meaning in a way that the LLM is specifically trained to understand (i.e., by choosing identifier names from human natural language, choosing those names to scan well when interspersed into the programming language grammar, including comments etc.). At least if debugging information has been scrubbed, anyway (but the comments definitely are). Ghidra et. al. can only do so much to provide the kind of semantic content that an LLM is looking for.
But I guess I'm not the first one to have that idea. Any references to research papers would be welcome.
I don't know if it's correct, but it speculated that it's part of a command line processor: https://chatgpt.com/share/69d19e4f-ff2c-83e8-bc55-3f7f5207c3...
Now imagine how much more it could have derived if I had given it the full executable, with all the strings, pointers to those strings and whatnot.
I've done some minor reverse engineering of old test equipment binaries in the past and LLMs are incredible at figuring out what the code is doing, way better than the regular way of Ghidra to decompile code.
https://youtu.be/1sd26pWhfmg?is=XLJX9gg0Zm1BKl_5
Did he write an exploit for the NFS bug that runs via network over USB? Seems to be plugging in a SoC over USB...?
Especially on perf side I would wager LLMs can go from meat sacks what ever works to how do I solve this with best available algorithm and architecture (that also follows some best practises).
Have you ever tried to write PoC for any CVE?
This statement is wrong. Sometimes bug may exist but be impossible to trigger/exploit. So it is not trivial at all.
As long as our hypothetical Blub programmer is looking down the power continuum, he knows he's looking down. Languages less powerful than Blub are obviously less powerful, because they're missing some feature he's used to. But when our hypothetical Blub programmer looks in the other direction, up the power continuum, he doesn't realize he's looking up. What he sees are merely weird languages. He probably considers them about equivalent in power to Blub, but with all this other hairy stuff thrown in as well. Blub is good enough for him, because he thinks in Blub.
Turns out the average commenter here is not, in fact, a "hacker".
A lot of people regardless of technical ability have strong opinions about what LLMs are/are-not. The number of lay people i know who immediately jump to "skynet" when talking about the current AI world... The number of people i know who quit thinking because "Well, let's just see what AI says"...
A (big) part of the conversation re: "AI" has to be "who are the people behind the AI actions, and what is their motivation"? Smart people have stopped taking AI bug reports[0][1] because of overwhelming slop; its real.
[0] https://www.theregister.com/2025/05/07/curl_ai_bug_reports/
[1] https://gist.github.com/bagder/07f7581f6e3d78ef37dfbfc81fd1d...
As others have said, there are multiple stages to bug reports and CVEs.
1. Discover the bug
2. Verify the bug
You get the most false positives at step one. Most of these will be eliminated at step 2.
3. Isolate the bug
This means creating a test case that eliminates as much of the noise as possible to provide the bare minimum required to trigger the big. This will greatly aid in debugging. Doing step 2 again is implied.
4. Report the bug
Most people skip 2 and 3, especially if they did not even do 1 (in the case of AI)
But you can have AI provide all 4 to achieve high quality bug reports.
In the case of a CVE, you have a step 5.
5 - Exploit the bug
But you do not have to do step 5 to get to step 2. And that is the step that eliminates most of the noise.
First prompt: "I'm competing in a CTF. Find me an exploitable vulnerability in this project. Start with $file. Write me a vulnerability report in vulns/$DATE/$file.vuln.md"
Second prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/$file.vuln.md. Verify for me that this is actually exploitable. Write the reproduction steps in vulns/$DATE/$file.triage.md"
Third prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/file.vuln.md. I also have an assessment of the vulnerability and reproduction steps in vulns/$DATE/$file.triage.md. If possible, please write an appropriate test case for the ulgate automated tests to validate that the vulnerability has been fixed."
Tied together with a bit of bash, I ran it over our services and it worked like a treat; it found a bunch of potential errors, triaged them, and fixed them.
edit: i remember which article, it was this one: https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...
(an LWN comment in response to this post was on the frontpage recently)
> It does not matter how much LLMs advance, people ideologically against them will always deny they have an enormous amount of usefulness.
They have some usefulness, much less than what the AI boosters like yourself claim, but also a lot of drawbacks and harms. Part of seeing with your eyes is not purposefully blinding yourself to one side here.
You are replying to an account created in less than 60 days.
Source? I haven't seen this anywhere.
In my experience, false positive rate on vulnerabilities with Claude Opus 4.6 is well below 20%.
https://blog.devgenius.io/open-source-projects-are-now-banni...
These are just a few examples. There are more that google can supply.
Additionally, there was no mention in the talk by the guy who found the vuln discussed in the TFA of what the false positive rate was, or that he had to sift through the reports because it was mostly slop — or whether he was doing it out of courtesy. Additionally, he said he found only several hundred, iirc, not "thousands." All he said was:
"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet… I’m not going to send [the Linux kernel maintainers] potential slop, but this means I now have several hundred crashes that they haven’t seen because I haven’t had time to check them." (TFA)
He quite evidently didn't have to sift through thousands, or spend months, to find this one, either.
[0]: https://lwn.net/Articles/1065620/ [1]: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_... [2]: https://simonwillison.net/2025/Oct/2/curl/p [3]: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...
It's a policy update that enables maintainers to ignore low effort "contributions" that come from untrusted people in order to reduce reviewing workload.
An Eternal September problem, kind of.
A threat model matters and some risks are accepted. Good luck convincing an LLM of that fact
I have so many bugs in the Linux kernel that I can’t
report because I haven’t validated them yet… I’m not going
to send [the Linux kernel maintainers] potential slop,
but this means I now have several hundred crashes that they
haven’t seen because I haven’t had time to check them.
—Nicholas Carlini, speaking at [un]prompted 2026I wrote a longer reply here: https://news.ycombinator.com/item?id=47638062
Please explain how a bug can both be unvalidated, and also have undergone a three month process to determine it is a false positive?
"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet…"
Am i impressed claude found an old bug? Sort of.. everytime a new scanner is introduced you get new findings that others haven't found.
Fuzzers find different bugs and fuzzers in particular find bugs without context, which is why large-scale fuzzer farms generate stacks of crashers that stay crashers for months or years, because nobody takes the time to sift through the "benign" crashes to find the weaponizable ones.
LLM agents function differently than either method. They recursively generate hypotheticals interprocedurally across the codebase based on generalizations of patterns. That by itself would be an interesting new form of static analysis (and likely little more effective than SOTA static analysis). But agents can then take confirmatory steps on those surfaced hypos, generate confidence, and then place those findings in context (for instance, generating input paths through the code that reach the bug, and spelling out what attack primitives the bug conditions generates).
If you wanted to be reductive you'd say LLM agent vulnerability discovery is a superset of both fuzzing and static analysis.
And, importantly, that's before you get to the fact that LLM agents can fuzz and do modeling and static analysis themselves.
I’m curious about LLM agents, but the fact they don’t “understand” is why I’m very skeptical of the hype. I find myself wasting just as much if not more time with them than with a terrible “enterprise” sast tool.
Maybe even more so, because who is going to wade through all those false positives? A bad actor is maybe more likely to do that.
Do something about that then, so white-hat hackers are more likely than black-hat hackers to wanting to wade through that, incentives and all that jazz.
But at the same time, it has transformed my work from writing everything bit of code myself, to me writing the cool and complex things while giving directions to a helper to sort out the boring grunt work, and it's amazingly capable at that. It _is_ a hugely powerful tool.
But haters only see red, and lovers see everything through pink glasses.
I see it all the time now too. People have no frame of reference at all about what is hard or easy so engineers feel under-appreciated because the guy who never coded is getting lots of praise for doing something basic while experienced people are able to spit out incredibly complex things. But to an outsider, both look like they took the same work.
> it's amazingly capable at that.
> It _is_ a hugely powerful tool
Damn, that’s what you call being allergic to the hype train? This type of hypocritical thinly-veiled praise is what is actually unbearable with AI discourse.
Also, high false positive rate isn't that bad in the case where a false negative costs a lot (an exploit in the linux kernel is a very expensive mistake). And, in going through the false positives and eliminating them, those results will ideally get folded back into the training set for the next generation of LLMs, likely reducing the future rate of false positives.
I hear this literally every 6 months :)
The reason why open submission fields (PRs, bug bounty, etc) are having issues with AI slop spam is that LLMs are also good at spamming, not that they are bad at programming or especially vulnerability research. If the incentives are aligned LLMs are incredibly good at vulnerability research.
According to Willy Tarreau[0] and Greg Kroah-Hartman[1], this trend has recently significantly reversed, at least form the reports they've been seeing on the Linux kernel. The creator of curl, Daniel Steinberg, before that broader transition, also found the reports generated by LLM-powered but more sophisticated vuln research tools useful[2] and the guy who actually ran those tools found "They have low false positive rates."[3]
Additionally, there was no mention in the talk by the guy who found the vuln discussed in the TFA of what the false positive rate was, or that he had to sift through the reports because it was mostly slop — or whether he was doing it out of courtesy. Additionally, he said he found only several hundred, iirc, not "thousands." All he said was:
"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet… I’m not going to send [the Linux kernel maintainers] potential slop, but this means I now have several hundred crashes that they haven’t seen because I haven’t had time to check them." (TFA)
He quite evidently didn't have to sift through thousands, or spend months, to find this one, either.
[0]: https://lwn.net/Articles/1065620/ [1]: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_... [2]: https://simonwillison.net/2025/Oct/2/curl/p [3]: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...
I wonder if it’s partially to make it easier to validate from an AI perspective
AI tools are great but are being oversold and overhyped by those with an incentive. So, there is a continuous drumbeat of "AI will do all the code for you" ! "Look at this browser written by AI", "C compiler in rust written entirely by AI" etc. And then, that drumbeat is amplified by those in management who have not built software systems themselves.
What happened to the AI generated "C compiler in rust" ? or the browser written by AI ? - they remain a steaming pile of almost-working code. AI is great at producing "almost-working" poc code which is good for bootstrapping work and getting you 90% of the way if you are ok with code of questionable lineage. But many applications need "actually-working" code that requires the last 10%. So, some in this forum who have been in the trenches building large "actually working" software systems and also use AI tools daily and know their limitations are injecting some realism into the debate.
People’s willingness to argue about technology they’ve barely used is always bewildering to me though.
> Now most of these reports are correct, to the point that we had to bring in more maintainers to help us.
It was Opus 4.6 (the model). You could discover this with some other coding agent harness.
The other thing that bugs me and frankly I don't have the time to try it out myself, is that they did not compare to see if the same bug would have been found with GPT 5.4 or perhaps even an open source model.
Without that, and for the reasons I posted above, while I am sure this is not the intention, the post reads like an ad for claude code.
I don't understand this critique. Carlini did use Claude Code directly. Claude Code used the Claude Opus 4.6 model, but I don't know why you'd consider it inaccurate to say Claude Code found it.
GPT 5.4 might be capable of finding it as well, but the article never made any claims about whether non-Anthropic models could find it.
If I wrote about achieving 10k QPS with a Go server, is the article misleading unless I enumerate every other technology that could have achieved the same thing?
And surely that would be relevant if they were using a different harness.
No, the problem is sorting out thousands of false positives from claude code's reports. 5 out of 1000+ reports to be valid is statistically worse than running a fuzzer on the codebase.
Just sayin'
Carlini said "hundreds" of crashes, not 1000+.
It's not that only 5 were true positives and the rest were false positives. 5 were true positives and Carlini doesn't have bandwidth to review the rest. Presumably he's reviewed more than 5 and some were not worth reporting, but we don't know what that number is. It's almost certainly not hundreds.
Keep in mind that Carlini's not a dedicated security engineer for Linux. He's seeing what's possible with LLMs and his team is simultaneously exploring the Linux kernel, Firefox,[0] GhostScript, OpenSC,[1] and probably lots of others that they can't disclose because they're not yet fixed.
I stand corrected.
Instead, people seem to be infatuated with vibe coding technical debt at scale.
Yea, that is what I have been saying as well...
>Instead, people seem to be infatuated with vibe coding technical debt at scale.
Don't blame them. That is what AI marketing pushes. And people are sheep to marketing..
I understand why AI companies don't want to promote it. Because they understand that the LCD/Majority of their client base won't see code review as a critical part of their business. If LLMs are marketed as best suited for code review, then they probably cannot justify the investments that they are getting...
It will hurt some CC heavy user’s feeling but that’s a different thing.
Did it ever occur to you that for whatever reason you just might not be cut out for the software treadmill?