We won't know if it's spoiled, or rather how spoiled, it is unless the companies release their training data.
But, in this case we can study in a different way. Use things we are certain are spoiled. That's what the author here does.
But as an ML researcher, I'll let you know that I don't trust a single reasoning paper I've read.
You either have to start with the premise that the thing you're testing is in the training data (and thus spoiled), so you typically look at generalization and how robust it is. You can't prove reasoning this was but you can disprove this way. This also works for theory of mind (which is seems many HN readers failed to read the first paragraph).
The other way is you need to prove that the data isn't in training (for a strong condition you need to prove that it's not even indirectly in the data...). You still can't prove reasoning this way but you would build strong evidence that it is going on (proving reasoning is very tough, if possible). I think if this was shown, consistently, then most of the conversations about LLMs not reasoning would go away and we'd discuss like humans: capable of reasoning, but not necessarily always doing so.
But ML is in an existential crisis right now. Theory means nothing without experimentation but experimentation means nothing without theory. See von Neumann's elephant