Skip to content
Better HN
Top
Best
Ask
Show
New
Jobs
Search
⌘K
0 points
VHRanger
2y ago
0 comments
Save
Share
Yeah, that's just the training stability part to my knowledge
0 comments
2 comments · 1 top-level
top
newest
oldest
whimsicalism
2y ago
· 1 in thread
they're also just less capable models. like just adding attention on top of an RNN made them a lot better
SpaceManNabs
2y ago
Calculating self-attention is still quadratic though. So you get the negatives of transformers there too.
j
/
k
navigate · click thread line to collapse