undefined | Better HN

0 pointsr00fus3y ago0 comments

Assuming they trained this LLM on SAT/LSAT/GRE prep materials, I would totally expect they could get it this good. It's like having benchmark-aware code.

I think the whole concept of standardized tests may need to be re-evaluated.

0 comments

rcme3y ago

> I would totally expect they could get it this good.

But would you have expected an algorithm to score 90th percentile on the LSAT two years ago? Our expectations of what an algorithm can do are being upended in real time. I think it's worth taking a moment to try to understand what the implications of these changes will be.

swatcoder3y ago

Yes. Being very familiar with the LSAT and being familiar enough with ML’s capability for finding patterns in volumes of similar data, I absolutely would have.

These LLM’s are really exciting, but benchmarks like these exploit people’s misconceptions about both standardized tests and the technology.

vishal01233y ago

From the paper

> We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.3 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.

zamnos3y ago

I think you're right, and that test prep materials were included in the dataset, even if only by accident. Except that humans have access to the same test prep materials, and they fail these exams all the time. The prep materials are just that, preparatory. They're representative of the test questions, but actual test has different passages to read and different questions. On to of that, the LSAT isn't a math test with formulas where you just substitute different numbers in. Which is to say, the study guides are good practice but passing the test on top of that represents having a good command of the English language and an understanding of the subject materials.

It's not the same as the Nvidia driver having code that says "if benchmark, cheat and don't render anything behind you because no one's looking".

EGreg3y ago

Humans fail because they cant review the entirety of test prep, can’t remember very much, and have a much smaller amount of “parameters” to store info in.

I would say LLMs store parameters that are quite superficial and don’t really get at the underlying concepts but given enough of those parameters, you can kind of cargo-cult your to an approximation of understanding.

It is like reconstructing the Mandelbrot set at every zoom level from deep learning. Try it!

technothrasher3y ago

They mention in the article that other than incidental material it may have seen in its general training data, they did not specifically train it for the tests.

2 more replies

dovin3y ago

Totally, there's no way they removed all the prep material as well when they were trying to address the "contamination" issue with these standardized tests:

> for each exam we run a variant with these questions removed and report the lower score of the two.

I think even with all that test prep material, which is surely helping the model get a higher score, the high scores are still pretty impressive.

gaudat3y ago

This feels the same as a human attending cram school to get better results in tests. Should we abolish them?

3 more replies

pas3y ago

> I think the whole concept of standardized tests may need to be re-evaluated.

It's perfectly fine as a proxy for future earnings of a human.

To use it for admissions? Meh. I think the whole credentialism thing is loooong overdue for some transformation, but people are conservative as fuck.

j / k navigate · click thread line to collapse