That attention heads are mandatory for transformers is a tautology (without it a transformer is just an MLP…) so of course this statement is going to be correct, by definition.
But when you move the goal post to land on a tautology then you've surrendered your abilities to argue anything and you are just ridiculing yourself. Take this question of your for instance:
> If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any.
Which is a legit, non-ridiculous, one.
If you replace it with your later much weaker argument:
> > If you know of any MLP that have had success (even at the GPT-2 level), I'd be interested to know what they are, because I don't know of any.
Then it becomes a dumb question given that MLP have no way of encoding context and can't process sequences of words in the first place.
So when you argue that it was your argument all along, it's particularly embarrassing because you're just arguing that your previous arguments were equally dumb even when they weren't.
That's why I said you're disrespecting your earlier argumentation by retreating to your later tautology.