undefined | Better HN

0 pointsbriga4mo ago0 comments

Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.

But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.

0 comments

stephc_int134mo ago

Even if the benchmark themselves are kept secret, the process to create them is not that difficult and anyone with a small team of engineers could make a replica in their own labs to train their models on.

Given the nature of how those models work, you don't need exact replicas.

j / k navigate · click thread line to collapse

0 pointsbriga4mo ago0 comments

But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.

0 comments

stephc_int134mo ago

Given the nature of how those models work, you don't need exact replicas.

j / k navigate · click thread line to collapse