Just check out the original UT paper, or some of it's follow ups: Neural Data Router, https://arxiv.org/abs/2110.07732; Sparse Universal Transformers (SUT), https://arxiv.org/abs/2310.07096.
There is even theoretical justification for why: https://arxiv.org/abs/2503.03961
The challenge is actually scaling them up to be useful as LLMs as well (I describe why it's a challenge in the SUT paper).
It's hard to say with the way ARC-AGI is allowed to be evaluated if this is actually what is at play. My gut tells me, given the type of data that's been allowed in the training set, that some leakage of the evaluation has happened in both HRM and TRM.
But because as a field we've given up on actually carefully ensuring training and test don't contaminate, we just decide it's fine and the effect is minimal. Especially considering LLMs, the test set example leaking into the dataset is merely a drop in the bucket (I don't believe we should be dismissing it this way, but that's a whole 'nother conversation).
With these models that are challenge-targeted, it becomes a much larger proportion of what influences the model behaviour, especially if the open evaluation sets are there for everyone to look at and simply generate more. Now we don't know if we're generalising or memorising.