They're also trained on random data scraped off the Internet which might include benchmarks, code that looks like them, and AI articles with things like chain of thought. There's been some effort to filter obvious benchmarks but is that enough? I cant know if the AI's are getting smarter on their own or more cheat sheets are in the training data.
Just brainstorming, one thing I came up with is training them on datasets from before the benchmarks or much AI-generated material existed. Keep testing algorithmic improvements on that in addition to models trained on up to date data. That might be a more accurate assessment.