- maybe ALL humans would fail the test in some way, eg. let's say everybody gets at least 10 of those wrong, and the average person gets 100 of those wrong.
- still, as long as most people correctly get each word right, your LLM would get every single response correct (because for each item in the test, 900+ people out of a thousand gave the same correct answer in the training set).
In that sense, it's totally possible for a system trained on a vast vat of average-human input to generate super-human outputs.