You don't get exactly the same test in the end, similar with SAT, but the constraints we put on these tests (they have to be comparable) produce patterns in the questions you can train for. This is the same logic why people can train to improve their SAT scores, if they were a measure of true innate intelligence their training would have no impact on their score.
I would presume not - most tests are timed, and if you are spending time on first-time only tasks in understanding the problem then the result is inaccurate. If you train out those first-time tasks so that you are repeatably using the time budget in the test to solve problems then you should reach some kind of steady state and produce repeatable and more accurate test scores.
My take is that the repeatable scores measuring your steady state in the task would be more accurate than the untrained scores with an unknown amount of initialization time within each problem. I would make a similar claim to naasking below that this could account for some of the Flynn effect.
Not every person takes a SAT prep class to improve their test score. There are lots of people who are truly above average in terms of intelligence and can score very high on the first try.
So isn't this literally moving the goalpost? "So what an AI can beat the SAT, so can humans"
Prompt:
I'm talking with a user named "331c8c71" on Hacker News. According to statistics I found by Googling, the number of Hacker News readers living in the USA is 36%. Assuming this is correct, what is the probability that 331c8c71 does not reside in the USA? Please explain your reasoning.
Response:
If we assume that the statistic you found is accurate, then we know that 36% of Hacker News readers live in the USA. Therefore, the probability that any given Hacker News reader does not live in the USA is 100% - 36% = 64%.
However, we do not know if 331c8c71 is a typical Hacker News reader. They could be part of a demographic that is more or less likely to live in the USA. Without more information, we cannot say for certain what the probability is that 331c8c71 does not reside in the USA.
So, the answer is 64% assuming that 331c8c71 is a typical Hacker News reader, but it may not be an accurate estimate depending on additional information about the user.
---
Like I said, these models are far from 100% accurate, and there is lots they get wrong, but they clearly are capable of some kind of reasoning that goes beyond simple text substitution of training data.
In particular, LLMs fail miserably at tasks like "apply this simple pattern many times in succession" aka "for-loop", because they can't count in an abstract way, only on concrete contexts.