AI leaderboards are no longer useful. It's time to switch to Pareto curves (opens in new tab)

(aisnakeoil.com)

40 pointsjobbagy2y ago14 comments

14 comments

8 comments · 4 top-level

ukuina2y ago· 4 in thread

This is the most applicable part of the article:

Strategies to improve LLM accuracy:

Retry: We repeatedly invoke a model with the temperature set to zero, up to five times, if it fails the test cases provided with the problem description. Retrying makes sense because LLMs aren’t deterministic even at temperature zero.

Warming: This is the same as the retry strategy, but we gradually increase the temperature of the underlying model with each run, from 0 to 0.5. This increases the stochasticity of the model and, we hope, increases the likelihood that at least one of the retries will succeed.

Escalation: We start with a cheap model (Llama-3 8B) and escalate to more expensive models (GPT-3.5, Llama-3 70B, GPT-4) if we encounter a test case failure.

vok2y ago

These strategies seem immediately practical. If you want to go beyond zero-shot for LLM coding, you may not need a complicated agent architecture - just start with escalation, retry, and warming.

smaddox2y ago

> Retrying makes sense because LLMs aren’t deterministic even at temperature zero.

This is news to me. I'm trying to think where non-determinism would come in at temperature zero, but coming up with nothing. What am I missing?

wongarsu2y ago

It can happen due to a number of reasons, but in the case of GPT-4 it's probably because of their MoE implementation

https://152334h.github.io/blog/non-determinism-in-gpt-4/

nicklecompte2y ago

It's because floating-point arithmetic isn't deterministic, which becomes salient when (speaking loosely) the difference between likelihood of two different tokens is less than the precision of the FPU.

I am not sure to what extent this effect has been quantified.

3 more replies

nicklecompte2y ago

I had a very similar comment last month - albeit more ignorant and less helpful: https://news.ycombinator.com/item?id=39957153

Basically, none of these agentic / MoE / etc papers have actually compared their results to the naive baseline: since these are nondeterministic programs, Randomized Algortihms 101 tells you that if the probability of success is sufficiently high, you can improve performance simply by running the algorithm multiple times and taking the majority/plurality result.

So is MoE or agents actually more effective than doing it the dumb way? AI Snake Oil says "no." Truly bizarre that dozens of researchers didn't even ask! It made me feel like I was missing something.

sgt1012y ago

I have advocated and used pareto fronts as a model selection method for ML for a long while. It's really useful to construct two tests - hard but important, and run of the mill - and plot model performance against each one and draw a pareto front so you can see which of your models are off the edge. In fact if you were to look at Figure 8.3 in "Managing Machine Learning Projects" then you would see this kind of thing!

But, I'm just an old bot shilling for my product.

wongarsu2y ago

Alternative title: AI leaderboards would be useful if they didn't blindly believe the author's benchmarks, included good baselines, and factored in real cost to run the model (parameter count can be misleading). Pareto curves are a good tool to decide which model is the best for a given price/performance tradeoff and should be used more

But that's not quite as catchy. Great article

j / k navigate · click thread line to collapse

14 comments

8 comments · 4 top-level

ukuina2y ago· 4 in thread

This is the most applicable part of the article:

Strategies to improve LLM accuracy:

Escalation: We start with a cheap model (Llama-3 8B) and escalate to more expensive models (GPT-3.5, Llama-3 70B, GPT-4) if we encounter a test case failure.

vok2y ago

These strategies seem immediately practical. If you want to go beyond zero-shot for LLM coding, you may not need a complicated agent architecture - just start with escalation, retry, and warming.

smaddox2y ago

> Retrying makes sense because LLMs aren’t deterministic even at temperature zero.

This is news to me. I'm trying to think where non-determinism would come in at temperature zero, but coming up with nothing. What am I missing?

wongarsu2y ago

It can happen due to a number of reasons, but in the case of GPT-4 it's probably because of their MoE implementation

https://152334h.github.io/blog/non-determinism-in-gpt-4/

nicklecompte2y ago

I am not sure to what extent this effect has been quantified.

3 more replies

nicklecompte2y ago

I had a very similar comment last month - albeit more ignorant and less helpful: https://news.ycombinator.com/item?id=39957153

So is MoE or agents actually more effective than doing it the dumb way? AI Snake Oil says "no." Truly bizarre that dozens of researchers didn't even ask! It made me feel like I was missing something.

sgt1012y ago

But, I'm just an old bot shilling for my product.

wongarsu2y ago

But that's not quite as catchy. Great article

j / k navigate · click thread line to collapse