Qodo CLI agent scores 71.2% on SWE-bench Verified (opens in new tab)

(qodo.ai)

139 pointsbobismyuncle9mo ago54 comments

54 comments

I’ve been running a bunch of coding agents on benchmarks recently as part of consulting, and this is actually much more impressive than it seems at first glance.

71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.

But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.

Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.

If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.

thinkingtoilet9mo ago

This is classic Goodhart's law. "When a measure becomes a target, it ceases to be a good measure"

https://en.wikipedia.org/wiki/Goodhart%27s_law

ambicapter9mo ago

It's really not that hard to not build a custom bench setup to game the benchmark instead of just using your product straight out of the box, though.

VikingCoder9mo ago

Right, other than financial pressure. Which is, of course, immense.

jasonjmcghee9mo ago

Right. Building a custom setup is blatant- that will wildly overfit.

But let's say a group uses it as a metric as part of CI and each new idea / feature they create runs against SWE bench. Maybe they have parameterized bits and pieces they adjust, maybe they have multiple candidates datasets for fine tuning, maybe they're choosing between checkpoints.

This will also end up overfitting - especially if done habitually. It might be a great metric and result in a more powerful overall model. Or it might not.

clutchdude9mo ago

Also see the VW dieselgate and numerous other "gaming the system" examples.

kelipso9mo ago

A specific setup for the benchmark is just plain cheating, not Goodhart’s law.

energy1239mo ago

What are the typical context lengths in SWE-bench problems? Does it partly measure performance in the 64-128k context range?

whymauri9mo ago

This is what the rows look like:

https://huggingface.co/datasets/princeton-nlp/SWE-bench_Veri...

Its up to your retrieval system/model to selectively hunt for relevant context. Here's a few critiques of the benchy:

https://x.com/brhydon/status/1953648884309536958

dimitri-vs9mo ago

IIRC the SWE bench dataset gives you the full repo snapshot + the issue text, the evaluation pipelines typically run some kind of retriever (eg. grep, BM25) to pick a subset of files to place in the model’s context. They provided context is usually limited up to ~50k tokens.

terminalshort9mo ago

Is there something in this multi-agent approach that makes the setup more specific to just the test at hand and less general to real engineering tasks? If not, then this multi-agent system will just become what you get out of the box in a future product. Multiple attempts per problem (as long as there's no human intervention or selection between them) is a perfectly fine approach for agents because that's not an issue from the perspective of an engineer using the product. A single agent is already a multi-step usage of LLMs and it sounds like this is just another meta level of that.

oblio9mo ago

https://github.com/auchenberg/volkswagen

eddd-ddde9mo ago

I think multiple attempts are completely understandable and even expected? How is that defeating the purpose of the benchmark?

gronky_9mo ago

It’s a pass@1 benchmark. When submitting you need to check a box that there was only 1 attempt per problem. See here for example: https://github.com/SWE-bench/experiments/pull/219

Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable

terminalshort9mo ago

From my perspective as a potential user the number of attempts is the number of times I have to tell it what to do. If you have an agent that makes a single attempt and is 60% accurate vs another that makes 5 attempts and is 80% accurate, why would you care that each individual attempt of the 2nd model is less accurate than the first?

5 more replies

szundi9mo ago

According to your experience with this model, is it just trained for the benchmark or these points are actually representing the performance?

1 more reply

Roritharr9mo ago

Finally someone mentions Refact, I was in contact with the team, rooting for them really.

bluelightning2k9mo ago

Just looked them up. Their pricing is around buying "coins" with no transparency as to what that gets. Hard pass

Roritharr9mo ago

You realize that you can self-host their stuff? https://github.com/smallcloudai/refact

ai-christianson9mo ago

One thing with SWE bench is making sure there's zero leakage of information into the LLM context.

I.e. the agent cannot even know which tests are failing.

It has to both fix the issue based just on the issue text and fix it in the specific way the unit test, which it cannot see, expects.

For this reason I find the benchmark a little disconnected from the reality of software engineering.

khalic9mo ago

We need some international body to start running these tests… I just can’t trust these numbers any longer. We need a platform for this, something at least we can get some peer reviews

redman259mo ago

That sounds like an interesting idea to me. It would at least resolve the problem of companies gaming the metric.

Another approach might be the LiveBench approach where new tests are released on a regular basis.

jcorco9mo ago

I’m working on this at STAC Research and looking to connect with others interested in helping. Key challenges are ensuring impartiality (and keeping it that way), making benchmarks ungameable, and guaranteeing reproducibility. We’ve done similar work in finance and are now applying the same principles to AI.

khalic8mo ago

That sounds amazing, mind telling us a little more?

jcorco8mo ago

Sure! STAC Research has been building and running benchmarks in finance for ~18 years. We’ve had to solve many of the same problems I think you’re highlighting here.. e.g. tech & model providers tuning specifically for the benchmark, results that get published but can’t be reproduced outside the provider’s lab, etc.

The approach is to use workloads defined by developers and end users (not providers) that reflect their real-world tasks. E.g. in finance, delivering market snapshots to trading engines. We test full stacks, holding some layers constant so you can isolate the effect of hardware, software, or models. Every run goes through an independent third-party audit to ensure consistent conditions, no cherry-picking of results, and full disclosure of config and tuning, so that the results are reproducible and the comparisons are fair.

In finance, the benchmarks are trusted enough to drive major infrastructure decisions by the leading banks and hedge funds, and in some cases to inform regulatory discussions, e.g. around how the industry handles time synchronization.

Now starting to apply the same principles to the AI benchmarking space. Would love to talk to anyone who wants to be involved?

1 more reply

mupuff12349mo ago

I'm curious how do these LLM wrapper companies think they'll survive long term - especially coding related wrappers.

I could understand focusing on a niche business use case, but coding is a main focus of the foundation models themselves.

M4R5H4LL9mo ago

Labeling them as “wrappers” and “niche business” indicates a strong cognitive bias already. Value can be created on both sides of the equation.

dgfitz9mo ago

How so? They are wrappers, and it is niche.

riku_iki8mo ago

I think those wrappers could create some potentially complex workflow around LLM API, with various trees of decisions, integrations, eval, rankers, ratets, etc, and this is their added value.

choilive9mo ago

Wrappers are a bit pejorative and reductive - everything is a wrapper around something else.

1 more reply

orangebread9mo ago

I've been using Warp for the past few weeks and it's been incredibly impressive over other agentic coding services/platforms. Curious how Qodo stacks up.

lightbendover9mo ago

When I tried warp I was convinced that was where the industry was going (agents as terminal replacement), but it felt a bit too heavy to me so I haven’t been using it lately. Still think all things will converge on terminal and browser replacement.

itamarcode9mo ago

Unlike most SWE bench submissions, Qodo Command one uses the product directly.

I think that the next step is getting an official "checked" mark by the SWE bench team

whymauri9mo ago

I feel like the bash only SWE Bench Verified (a.k.a model + mini-swe-agent) is the closest thing to measuring the inherent ability of the model vs. the scaffolding.

https://github.com/SWE-agent/mini-swe-agent

NitpickLawyer9mo ago

There's swe-rebench, where they take "bugs/issues" by date, and you can drag a slider on their top scores to see issues solved after the model was released (obviously only truly working for open models).

raylad9mo ago

If it's really better than Claude Code while using Sonnet 4.0, then I'd pay a monthly fee for it, but only if I can use my Claude subscription the same way Claude Code does.

I do not want to pay API charges or be limited to a fixed number of "credits" per month.

esafak9mo ago

If Qodo is reading: please compare your efficiency too. Run some tasks on various agents using the same models, and report the cost.

lirantal9mo ago

Slick. This applies to the new Qodo Command CLI, yes?

I updated to the latest version last night. Enjoyed seeing the process permission toggle (rwx). Was a refreshing change to keep the security minded folks less in panic with all the agentic coding adoptions :-)

zuzuen_19mo ago

I would be more interested in Qodo's performance on the swe-bench-multilingual benchmark. Swe-bench-verified only includes bugs related to python breakages.

The best submission is swe-bench-multilingual is Claude 3.7 Sonnet which solves ~43% of the issues in the dataset.

zuzuen_19mo ago

Does anyone have a benchmark on the effectiveness of using embeddings for mapping bug reports to code files as opposed to extensive grepping as Qodo, Cursor and a number of tools I use do to localize faults?

afro889mo ago

If Qodo are reading this: please introduce a plan that isn't for teams or enterprise. A "pro" plan for individuals who want more than 250 credits per month.

OldGreenYodaGPT9mo ago

Was using their bot for code review for last 2 years but just dropped it for BugBot

OldfieldFund9mo ago

do we know anything about the size of the model? I can't find the answer.

khalic9mo ago

it's sonnet behind the scene

rs1869mo ago

So this is from the same company that wrote a blog post with sentences that don't even make sense:

https://news.ycombinator.com/item?id=44833929, my comment https://news.ycombinator.com/item?id=44835939

j / k navigate · click thread line to collapse

54 comments

gronky_9mo ago

I’ve been running a bunch of coding agents on benchmarks recently as part of consulting, and this is actually much more impressive than it seems at first glance.

71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.

If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.

thinkingtoilet9mo ago

This is classic Goodhart's law. "When a measure becomes a target, it ceases to be a good measure"

https://en.wikipedia.org/wiki/Goodhart%27s_law

ambicapter9mo ago

It's really not that hard to not build a custom bench setup to game the benchmark instead of just using your product straight out of the box, though.

VikingCoder9mo ago

Right, other than financial pressure. Which is, of course, immense.

jasonjmcghee9mo ago

Right. Building a custom setup is blatant- that will wildly overfit.

This will also end up overfitting - especially if done habitually. It might be a great metric and result in a more powerful overall model. Or it might not.

clutchdude9mo ago

Also see the VW dieselgate and numerous other "gaming the system" examples.

kelipso9mo ago

A specific setup for the benchmark is just plain cheating, not Goodhart’s law.

energy1239mo ago

What are the typical context lengths in SWE-bench problems? Does it partly measure performance in the 64-128k context range?

whymauri9mo ago

This is what the rows look like:

https://huggingface.co/datasets/princeton-nlp/SWE-bench_Veri...

Its up to your retrieval system/model to selectively hunt for relevant context. Here's a few critiques of the benchy:

https://x.com/brhydon/status/1953648884309536958

dimitri-vs9mo ago

terminalshort9mo ago

oblio9mo ago

https://github.com/auchenberg/volkswagen

eddd-ddde9mo ago

I think multiple attempts are completely understandable and even expected? How is that defeating the purpose of the benchmark?

gronky_9mo ago

It’s a pass@1 benchmark. When submitting you need to check a box that there was only 1 attempt per problem. See here for example: https://github.com/SWE-bench/experiments/pull/219

Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable

terminalshort9mo ago

5 more replies

szundi9mo ago

According to your experience with this model, is it just trained for the benchmark or these points are actually representing the performance?

1 more reply

Roritharr9mo ago

Finally someone mentions Refact, I was in contact with the team, rooting for them really.

bluelightning2k9mo ago

Just looked them up. Their pricing is around buying "coins" with no transparency as to what that gets. Hard pass

Roritharr9mo ago

You realize that you can self-host their stuff? https://github.com/smallcloudai/refact

ai-christianson9mo ago

One thing with SWE bench is making sure there's zero leakage of information into the LLM context.

I.e. the agent cannot even know which tests are failing.

It has to both fix the issue based just on the issue text and fix it in the specific way the unit test, which it cannot see, expects.

For this reason I find the benchmark a little disconnected from the reality of software engineering.

khalic9mo ago

We need some international body to start running these tests… I just can’t trust these numbers any longer. We need a platform for this, something at least we can get some peer reviews

redman259mo ago

That sounds like an interesting idea to me. It would at least resolve the problem of companies gaming the metric.

Another approach might be the LiveBench approach where new tests are released on a regular basis.

jcorco9mo ago

khalic8mo ago

That sounds amazing, mind telling us a little more?

jcorco8mo ago

Now starting to apply the same principles to the AI benchmarking space. Would love to talk to anyone who wants to be involved?

1 more reply

mupuff12349mo ago

I'm curious how do these LLM wrapper companies think they'll survive long term - especially coding related wrappers.

I could understand focusing on a niche business use case, but coding is a main focus of the foundation models themselves.

M4R5H4LL9mo ago

Labeling them as “wrappers” and “niche business” indicates a strong cognitive bias already. Value can be created on both sides of the equation.

dgfitz9mo ago

How so? They are wrappers, and it is niche.

riku_iki8mo ago

I think those wrappers could create some potentially complex workflow around LLM API, with various trees of decisions, integrations, eval, rankers, ratets, etc, and this is their added value.

choilive9mo ago

Wrappers are a bit pejorative and reductive - everything is a wrapper around something else.

1 more reply

orangebread9mo ago

I've been using Warp for the past few weeks and it's been incredibly impressive over other agentic coding services/platforms. Curious how Qodo stacks up.

lightbendover9mo ago

itamarcode9mo ago

Unlike most SWE bench submissions, Qodo Command one uses the product directly.

I think that the next step is getting an official "checked" mark by the SWE bench team

whymauri9mo ago

I feel like the bash only SWE Bench Verified (a.k.a model + mini-swe-agent) is the closest thing to measuring the inherent ability of the model vs. the scaffolding.

https://github.com/SWE-agent/mini-swe-agent

NitpickLawyer9mo ago

raylad9mo ago

If it's really better than Claude Code while using Sonnet 4.0, then I'd pay a monthly fee for it, but only if I can use my Claude subscription the same way Claude Code does.

I do not want to pay API charges or be limited to a fixed number of "credits" per month.

esafak9mo ago

If Qodo is reading: please compare your efficiency too. Run some tasks on various agents using the same models, and report the cost.

lirantal9mo ago

Slick. This applies to the new Qodo Command CLI, yes?

zuzuen_19mo ago

I would be more interested in Qodo's performance on the swe-bench-multilingual benchmark. Swe-bench-verified only includes bugs related to python breakages.

The best submission is swe-bench-multilingual is Claude 3.7 Sonnet which solves ~43% of the issues in the dataset.

zuzuen_19mo ago

afro889mo ago

If Qodo are reading this: please introduce a plan that isn't for teams or enterprise. A "pro" plan for individuals who want more than 250 credits per month.

OldGreenYodaGPT9mo ago

Was using their bot for code review for last 2 years but just dropped it for BugBot

OldfieldFund9mo ago

do we know anything about the size of the model? I can't find the answer.

khalic9mo ago

it's sonnet behind the scene

rs1869mo ago

So this is from the same company that wrote a blog post with sentences that don't even make sense:

https://news.ycombinator.com/item?id=44833929, my comment https://news.ycombinator.com/item?id=44835939

j / k navigate · click thread line to collapse