That’s to prevent overfitting on their dataset, it is not to prevent overfitting on the test data, which is likely in their dataset.
You basically cannot beat GPT-4 on broad reasoning tasks, which the tests are designed to cover, without having some of the tests leaking into training dataset. There simply aren’t enough parameters and isn’t enough training to make that possible.