undefined | Better HN

0 pointsepups2y ago0 comments

> Keep in mind that modern quantitative approaches to LLM evaluation have been effectively co-designed with the rise of OpenAI, and folks like Ravenwolf routinely disagree with the leaderboards.

Sorry but you're talking complete nonsense here. The benchmark by LMSys (chatbot arena) cannot be gamed, and Ravenwolf is a random-ass poster with no scientific rigor to his benchmarks.

0 comments

ParetoOptimal2y ago

Cannot be gamed? C'mon now... You could pay a bunch of people to vote for your model in the arena.

epupsOP2y ago

No you can't, because you actually don't know which model is which when you vote.

ParetoOptimal2y ago

Do only the initial votes count? Because after I made an initial choice I was then put in a session where I saw the name of both of the AI. I made subsequent votes in that session where I could see their names.

epupsOP2y ago

https://github.com/lm-sys/FastChat/issues/1210

j / k navigate · click thread line to collapse

0 comments

ParetoOptimal2y ago

Cannot be gamed? C'mon now... You could pay a bunch of people to vote for your model in the arena.

epupsOP2y ago

No you can't, because you actually don't know which model is which when you vote.

ParetoOptimal2y ago

epupsOP2y ago

https://github.com/lm-sys/FastChat/issues/1210

j / k navigate · click thread line to collapse