Read a study called "The Leaderboard Illusion" which credibly alleged that Meta Google OpenAI and Amazon got unfair treatment from LM Arena that distorted the benchmarks
They gave them special access to privately test and let them benchmark over and over without showing the failed tests
Meta got to privately test Llama 4 27 times to optimize it for high benchmark scores and then was allowed to report the only the highest cherry picked benchmark
Which makes sense because in real world applications Llama is recognized to be markedly inferior to models that scored lower