The original comment was this:
> So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.
The comment I replied to was:
> That's how 99% of 'LLM benchmark numbers' circulating on the internet work.
And that's just false. SWE-Bench verified isn't like this. Aider Polyglot isn't like this. SWE-Lancer Diamond isn't like this. The new internal benchmarks used by OpenAI in GPT-5's model card aren't like this.
Maybe this benchmark is a special snowflake and needs LLM-as-a-judge, but this doesn't invalidate the original concern: setting up a benchmark this way runs into a series of problems and is prone to show performance differences that might not be there with a different setups. Benchmarks are already hard to trust, I'm not sure how this is any more indicative than the rest.