undefined | Better HN

0 pointsVHRanger2y ago0 comments

Yeah, that's just the training stability part to my knowledge

0 comments

2 comments · 1 top-level

whimsicalism2y ago· 1 in thread

they're also just less capable models. like just adding attention on top of an RNN made them a lot better

Calculating self-attention is still quadratic though. So you get the negatives of transformers there too.

j / k navigate · click thread line to collapse