I think the whole concept of standardized tests may need to be re-evaluated.
But would you have expected an algorithm to score 90th percentile on the LSAT two years ago? Our expectations of what an algorithm can do are being upended in real time. I think it's worth taking a moment to try to understand what the implications of these changes will be.
These LLM’s are really exciting, but benchmarks like these exploit people’s misconceptions about both standardized tests and the technology.
> We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.3 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.
It's not the same as the Nvidia driver having code that says "if benchmark, cheat and don't render anything behind you because no one's looking".
I would say LLMs store parameters that are quite superficial and don’t really get at the underlying concepts but given enough of those parameters, you can kind of cargo-cult your to an approximation of understanding.
It is like reconstructing the Mandelbrot set at every zoom level from deep learning. Try it!
> for each exam we run a variant with these questions removed and report the lower score of the two.
I think even with all that test prep material, which is surely helping the model get a higher score, the high scores are still pretty impressive.
It's perfectly fine as a proxy for future earnings of a human.
To use it for admissions? Meh. I think the whole credentialism thing is loooong overdue for some transformation, but people are conservative as fuck.