84% Claude 3.5 Sonnet 10/22
80% o1-preview
77% Claude 3.5 Sonnet 06/20
72% DeepSeek V2.5
72% GPT-4o 08/06
71% o1-mini
68% Claude 3 Opus
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%! 92% Sonnet 10/22
75% o1-preview
72% Opus
64% Sonnet 06/20
49% GPT-4o 08/06
45% o1-mini
https://aider.chat/docs/leaderboards/Can someone explain these Aider benchmarks to me? They pass same 113 tests through llm every time. Why they then extrapolate ability of llm to pass these 113 basic python challenges to the general ability to produce/edit code? Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?
Did anyone ever try to change them test cases or wiggle conditions a bit to see if it will still hit the same %?
They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks.
If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?
I would exclude them as well as possible so I get feedback on how "real" any model improvement is. I need to develop real world improvements in the end, and any short term gain in usage by cheating in benchmarks seems very foolish.
Yes, this is an inherit problem with the whole idea of LLM's. They're pattern recognition "students" but the important thing, that all the providers like to sell is their reasoning. A good test is a reasoning test. I'll try to find a link and update with a reference.
Claude is way less controllable it is difficult to get it to do exactly what I want. ChatGPT is way easier to control in terms of asking for specific changes.
Not sure why that is maybe the chain of thought and instruction tuning dataset has made theirs a lot better for interactive use.
Example; I asked it to write some js that finds a button on a page, clicks the button, then waits for a new element with some selector to appear and return a ref to it; chatgpt kept returning (pseudo code);
while (true) {
button.click()
wait()
oldItems = ...
newItems = ...
newItem = newItems - oldItems
if (newItem) return newItem
sleep(1)
}
which obviously doesn't work. Claude understands to put the oldItems outside the while; even when I tell chatgpt to do that, it doesn't. Or it does one time and with another change, it moves it back in.
Do any of these actually help coding?
Answering myself: ”Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism”
Not gonna start looking for a job any time soon
> Convert a hexadecimal number, represented as a string (e.g. "10af8c"), to its decimal equivalent using first principles (i.e. no, you may not use built-in or external libraries to accomplish the conversion).
So it's fairly synthetic. It's also the sort of thing LLMs should be great at since I'm sure there's tons of data on this sort of thing online.
If you use "claude-3-5-sonnet-latest" you'll be upgraded to "claude-3-5-sonnet-20241022" already - I tested that this morning.
If you're on "claude-3-5-sonnet-20240620" you'll need to change that ID to either the -latest one or the -20241022 one.
Questions are variants of:
Refactor the _set_csrf_cookie method in the CsrfViewMiddleware class to be a stand alone, top level function. Name the new function _set_csrf_cookie, exactly the same name as the existing method. Update any existing self._set_csrf_cookie calls to work with the new _set_csrf_cookie function.