Many security tools for doing security audits with LLMs are based around a "look at this file" loop, where each file gets analyzed individually. As I noted in the post, I don't consider that a hint at all. In a real security audit, the model would have gotten the prompt "look at this file for security issues".
And, it's probably also how Mythos was used for auditing when it found these bugs. At least a couple of folks at Anthropic have discussed using a loop like that for finding security bugs, which was the inspiration for Nelson, which is what this benchmark project sprung out of.
Nonetheless, I'm currently performing benchmarks of "look at this repo, find any security bugs", because I suspect the really good models will be able to spot some of the hard bugs that span multiple files (the models always have the tools to look at other files, but maybe didn't take time to fully comprehend the full source before tying to find security issues). Those will take a lot longer and cost a lot more. There will be a lot more noise in that benchmark, though, as it'll probably find dozens of real bugs of varying severity and more false positives, which have to be judged, as well.