> It's AMC-12 scores aren't awful.
A blank test scores 37.5
The best score 60 is 5 correct answers + 20 blank answers; or 6 correct, 4 correct random guesses, and 15 incorrect random guesses. (20% chance of correct guess)
The 5 easiest questions are relatively simple calculations, once the parsing task is achieved.
(Example: https://artofproblemsolving.com/wiki/index.php/2022_AMC_12A_... )
so the main factor in that score is how good GPT is at refusing to answer a question, or doing a bit better to overcome the guessing penalty.
> It's AMC 10 score being dramatically lower is pretty bad though...
All versions (scoring 30, 36) It scored worse than leaving the test blank.
The only explanation I can imagine for that is that it can't understand diagrams.
It's also unclear if the AMC performance is based on Englush or the computer-encoded version from this benchmark set: https://arxiv.org/pdf/2109.00110.pdf https://openai.com/research/formal-math
AMC/AIME and even to some extent USAMO/IMO problems are hard for humans because they are time-limited and closed-book. But they aren't conceptually hard -- they are solved by applying a subset of known set of theorems a few times to the input data.
The hard part of math, for humans, is ingesting data into their brains, retaining it, and searching it. Humans are bad a memorizing large databases of symbolic data, but that's trivial for a large computer system.
An AI system has a comprehensive library, and high-speech search algorithms.
Can someone who pays $20/month please post some sample AMC10/AMC12 Q&A?