The Mixtral grading model calculates the original starting votes which can be further influenced by Users voting on their preferred answer which affects the leaderboard standings.
It should be noted that Mixtral 8x7B didn't grade its own model very high at 11th, it's standout was grading Microsoft's WizardLM2 model pretty high at #2. Although it's not entirely without merit as at the time of release it was Microsoft's most advanced model and the best opensource LLM available [1]. Which we also found generated great high quality answers which I'm surprised it's not more used as it's only OpenRouter's 15th most used model this month [2], although it's received very little marketing behind it, essentially just an announcement blog post.
Whilst nothing is perfect we're happy with the Grading system as it's still able to identify good answers from bad ones, good models from bad ones and which topics models perform poorly on. Some of the grades are surprising since we have prejudices on where models should rank before the results are concluded, which is also why it's important to have multiple independent benchmarks, especially benchmarks that LLMs aren't optimized for as I've often been disappointed by how some models perform in practice vs how well they perform in benchmarks.
Either way you can inspect the different answers from the different models yourself by paging through the popular questions [3]:
[1] https://wizardlm.github.io/WizardLM2/
[2] https://openrouter.ai/rankings?view=month
[3] https://pvq.app/questions