Strategies to improve LLM accuracy:
Retry: We repeatedly invoke a model with the temperature set to zero, up to five times, if it fails the test cases provided with the problem description. Retrying makes sense because LLMs aren’t deterministic even at temperature zero.
Warming: This is the same as the retry strategy, but we gradually increase the temperature of the underlying model with each run, from 0 to 0.5. This increases the stochasticity of the model and, we hope, increases the likelihood that at least one of the retries will succeed.
Escalation: We start with a cheap model (Llama-3 8B) and escalate to more expensive models (GPT-3.5, Llama-3 70B, GPT-4) if we encounter a test case failure.
This is news to me. I'm trying to think where non-determinism would come in at temperature zero, but coming up with nothing. What am I missing?
I am not sure to what extent this effect has been quantified.
Basically, none of these agentic / MoE / etc papers have actually compared their results to the naive baseline: since these are nondeterministic programs, Randomized Algortihms 101 tells you that if the probability of success is sufficiently high, you can improve performance simply by running the algorithm multiple times and taking the majority/plurality result.
So is MoE or agents actually more effective than doing it the dumb way? AI Snake Oil says "no." Truly bizarre that dozens of researchers didn't even ask! It made me feel like I was missing something.
But, I'm just an old bot shilling for my product.
But that's not quite as catchy. Great article