Yes some operations are not supported in MPS/TPU and falls back to slower CPU. But for common architectures like transformers and convnets, it works very well for all the datasets.
I never claimed it was easy. I meant in my opinion it is in the order of 10s of millions dollars of investment, not a trillion dollar CUDA moat that people comment here.