Transformers work well in sequence tasks because both compare well in terms of accuracy but also scale better than a RNNs like a LSTM or a GRU. That means they can be trained on more data.
This isn't really the same as CNNs, where they model images by running at different scales. I'm not aware of any cases of Transformers being used particularly successfully on images.
They can be used on graphs of course, by translating the problem into a graph walk problem (ala DeepWalk).
All the examples you gave (language modelling, Dota2, music and protein modelling) are setup as sequence prediction problems, so are perfect for Transformers.