A CPU is a finite state machine, so adding an unbounded tape is trivial to make a theoretical TC.
The arbitrary precision activation function and position requirements are to keep the attention dynamic reweighting values in the computable set.
As even multi layer neural networks use the shifting, reflection and sum of line segments to produce their curve, the results of those operations may not map to representable numbers even given unbounded digits when using typical activation functions.
Using an activation function that keeps results in aleph-nought, or a countable infinity is what allows for it to be TC.
Probably Approximately Correct or PAC learning is intentionally fuzzy.
The occasional gradant loss problem with ReLU is possibly a lens to think about this in.
But the success of statistical learning in the past 30 years has been largely related to having existential quantifiers with acceptable training loss. Following the very useful concept from stats that all models are wrong but some are useful.
Transformer models will most definitely be useful for some problems, assuming that a physically unrealizable configuration is TC will hold will lead to wasted efforts.
Simply acknowledging the potential dead ends of a technology helps with not only choosing the right path but recognizing early that you need to change course.
IMHO, this posts papers method as a lens is far more useful as an intuition.