undefined | Better HN

0 pointsanotherpaulg1y ago0 comments

The new Sonnet tops aider's code editing leaderboard at 84.2%. Using aider's "architect" mode it sets the SOTA at 85.7% (with DeepSeek as the "editor" model).

  84% Claude 3.5 Sonnet 10/22
  80% o1-preview
  77% Claude 3.5 Sonnet 06/20
  72% DeepSeek V2.5
  72% GPT-4o 08/06
  71% o1-mini
  68% Claude 3 Opus

It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%!

  92% Sonnet 10/22
  75% o1-preview
  72% Opus
  64% Sonnet 06/20
  49% GPT-4o 08/06
  45% o1-mini

https://aider.chat/docs/leaderboards/

0 comments

18 comments · 6 top-level

gloosx1y ago· 5 in thread

I will repeat my question from one of the previous threads:

Can someone explain these Aider benchmarks to me? They pass same 113 tests through llm every time. Why they then extrapolate ability of llm to pass these 113 basic python challenges to the general ability to produce/edit code? Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

Did anyone ever try to change them test cases or wiggle conditions a bit to see if it will still hit the same %?

carschno1y ago

Indeed, test data like this constantly leaks into the training data, so these leaderboards are not necessarily representative for real-world problems. A better approach is to use variable evaluation like GSM-Symbolic (for evaluating mathematic reasoning): https://arxiv.org/abs/2410.05229

jo9091y ago

> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks.

If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?

I would exclude them as well as possible so I get feedback on how "real" any model improvement is. I need to develop real world improvements in the end, and any short term gain in usage by cheating in benchmarks seems very foolish.

5 more replies

bilekas1y ago

> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

Yes, this is an inherit problem with the whole idea of LLM's. They're pattern recognition "students" but the important thing, that all the providers like to sell is their reasoning. A good test is a reasoning test. I'll try to find a link and update with a reference.

Lucasoato1y ago

There is an opportunity to develop black-box benchmarks and offer them to LLM providers to support their testing phase. If I were in their place, I would find it incredibly valuable to have such tamper-proof testing before releasing a model.

gloosx1y ago

Conveniently, author of these benchmarks remains silent on topic every time. Think about it :)

faizshah1y ago· 3 in thread

Anecdotally but I still get significantly better results from ChatGPT than claude for coding.

Claude is way less controllable it is difficult to get it to do exactly what I want. ChatGPT is way easier to control in terms of asking for specific changes.

Not sure why that is maybe the chain of thought and instruction tuning dataset has made theirs a lot better for interactive use.

anonzzzies1y ago

For me it's the opposite; chatgpt (o1 preview and 4o) keep making very strange errors; errors that I even exactly tell it how to fix and it simply repeats the fundamental mistakes again. With claude, I did not have that.

Example; I asked it to write some js that finds a button on a page, clicks the button, then waits for a new element with some selector to appear and return a ref to it; chatgpt kept returning (pseudo code);

while (true) {

button.click()

wait()

oldItems = ...

newItems = ...

newItem = newItems - oldItems

if (newItem) return newItem

sleep(1)

}

which obviously doesn't work. Claude understands to put the oldItems outside the while; even when I tell chatgpt to do that, it doesn't. Or it does one time and with another change, it moves it back in.

bigs1y ago

Try as I might, ChatGPT couldn’t give me working code for a simple admin dash layout in Vue with a sidebar than can minimise. I had to correct it, it would say my apologies and provide new code with a different error. About 10 times in a row it got in a loop of errors and I gave up.

Do any of these actually help coding?

3 more replies

csomar1y ago

Maybe it's relative? Claude beats GPT-4/o by a far margin for me but I am mostly using them for Rust.

1 more reply

ianeigorndua1y ago· 2 in thread

Are these synthetic or real-world benchmarks?

Answering myself: ”Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism”

Not gonna start looking for a job any time soon

zeroonetwothree1y ago

Example I chose at random:

> Convert a hexadecimal number, represented as a string (e.g. "10af8c"), to its decimal equivalent using first principles (i.e. no, you may not use built-in or external libraries to accomplish the conversion).

So it's fairly synthetic. It's also the sort of thing LLMs should be great at since I'm sure there's tons of data on this sort of thing online.

1 more reply

stavros1y ago

I use Claude for coding and it's fantastic. I definitely have outsourced a lot of my coding to it.

3 more replies

miki1232111y ago· 1 in thread

When using these models via the official Anthropic API, do I have to do anything to "opt in" to the new Sonnet, or am I switched over automatically?

simonw1y ago

That depends on the model ID you are using.

If you use "claude-3-5-sonnet-latest" you'll be upgraded to "claude-3-5-sonnet-20241022" already - I tested that this morning.

If you're on "claude-3-5-sonnet-20240620" you'll need to change that ID to either the -latest one or the -20241022 one.

usaar3331y ago· 1 in thread

FWIW, the refactor benchmark is quite mechanical - it just stresses reliability of LLMs over long context windows:

Questions are variants of:

Refactor the _set_csrf_cookie method in the CsrfViewMiddleware class to be a stand alone, top level function. Name the new function _set_csrf_cookie, exactly the same name as the existing method. Update any existing self._set_csrf_cookie calls to work with the new _set_csrf_cookie function.

jstummbillig1y ago

Assuming that that is indeed what most of the benchmark does: If the LLMs are as bad as it as the numbers suggest, then it seems like a perfectly good benchmark. I would definitely want them to be able to do stuff like that when I let them write my code.

artemisart1y ago

Thanks! I was waiting for your benchmarks. Do you plan to test haiku 3.5 too? It would be nice to show API prices needed to run the whole benchmark too to have a better idea of how many internal tokens o1 models consume.

j / k navigate · click thread line to collapse

0 comments

18 comments · 6 top-level

gloosx1y ago· 5 in thread

I will repeat my question from one of the previous threads:

Did anyone ever try to change them test cases or wiggle conditions a bit to see if it will still hit the same %?

carschno1y ago

jo9091y ago

> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks.

5 more replies

bilekas1y ago

> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

Lucasoato1y ago

gloosx1y ago

Conveniently, author of these benchmarks remains silent on topic every time. Think about it :)

faizshah1y ago· 3 in thread

Anecdotally but I still get significantly better results from ChatGPT than claude for coding.

Claude is way less controllable it is difficult to get it to do exactly what I want. ChatGPT is way easier to control in terms of asking for specific changes.

Not sure why that is maybe the chain of thought and instruction tuning dataset has made theirs a lot better for interactive use.

anonzzzies1y ago

while (true) {

button.click()

wait()

oldItems = ...

newItems = ...

newItem = newItems - oldItems

if (newItem) return newItem

sleep(1)

}

bigs1y ago

Do any of these actually help coding?

3 more replies

csomar1y ago

Maybe it's relative? Claude beats GPT-4/o by a far margin for me but I am mostly using them for Rust.

1 more reply

ianeigorndua1y ago· 2 in thread

Are these synthetic or real-world benchmarks?

Answering myself: ”Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism”

Not gonna start looking for a job any time soon

zeroonetwothree1y ago

Example I chose at random:

So it's fairly synthetic. It's also the sort of thing LLMs should be great at since I'm sure there's tons of data on this sort of thing online.

1 more reply

stavros1y ago

I use Claude for coding and it's fantastic. I definitely have outsourced a lot of my coding to it.

3 more replies

miki1232111y ago· 1 in thread

When using these models via the official Anthropic API, do I have to do anything to "opt in" to the new Sonnet, or am I switched over automatically?

simonw1y ago

That depends on the model ID you are using.

If you use "claude-3-5-sonnet-latest" you'll be upgraded to "claude-3-5-sonnet-20241022" already - I tested that this morning.

If you're on "claude-3-5-sonnet-20240620" you'll need to change that ID to either the -latest one or the -20241022 one.

usaar3331y ago· 1 in thread

FWIW, the refactor benchmark is quite mechanical - it just stresses reliability of LLMs over long context windows:

Questions are variants of:

jstummbillig1y ago

artemisart1y ago

j / k navigate · click thread line to collapse