undefined | Better HN

0 pointslucidrains2y ago0 comments

could even try it with a fraction of the attention heads, instead of introducing new tokens

0 comments

2 comments · 1 top-level

sdenton42y ago· 1 in thread

An important piece here is that there's still a training signal making it to the makes weights. See SimSiam for a similar example.

indeed, simsiam is a great example of the effectiveness of using stop gradient

j / k navigate · click thread line to collapse