undefined | Better HN

0 pointsnl7y ago0 comments

It's not really as clear cut as that.

Transformers work well in sequence tasks because both compare well in terms of accuracy but also scale better than a RNNs like a LSTM or a GRU. That means they can be trained on more data.

This isn't really the same as CNNs, where they model images by running at different scales. I'm not aware of any cases of Transformers being used particularly successfully on images.

They can be used on graphs of course, by translating the problem into a graph walk problem (ala DeepWalk).

All the examples you gave (language modelling, Dota2, music and protein modelling) are setup as sequence prediction problems, so are perfect for Transformers.

0 comments

2 comments · 1 top-level

p1esk7y ago· 1 in thread

https://arxiv.org/abs/1904.09925

nlOP7y ago

Nice. I guess I'm 8 days behind on the SOTA...

But I'd note that it is build on top of a CNN base (ResNet or RetinaNet) and that the Attention-only system performed slightly worse than the one including the CNN layers.

Also, this isn't really a Transformer architecture, even though it uses Attention.

But maybe this is too much nitpicking? I agree that Attention is a useful primitive - my point is that the Transformer architecture is too specific.

(Also, this is a really nice paper in that it lays out the hyperparameters and training schedules they used. And that Appendix is amazing!)

j / k navigate · click thread line to collapse