With public benchmarks we're trusting the labs not to cheat. And it's easy to "cheat" accidentally - they actually need to make a serious effort to not contaminate the training data.
And there's massive incentives for the labs to cheat in order to get the hype going around their launch and justify their massive investments in training. It doesn't have to be the CEO who's directing it. Can even be one/a few researchers who are responsible for a specific area of model performance and are under tremendous pressure to deliver.