Skip to content
Better HN
Top
Best
Ask
Show
New
Jobs
Search
⌘K
0 points
lucidrains
2y ago
0 comments
Save
Share
could even try it with a fraction of the attention heads, instead of introducing new tokens
0 comments
2 comments · 1 top-level
top
newest
oldest
sdenton4
2y ago
· 1 in thread
An important piece here is that there's still a training signal making it to the makes weights. See SimSiam for a similar example.
lucidrains
OP
2y ago
indeed, simsiam is a great example of the effectiveness of using stop gradient
j
/
k
navigate · click thread line to collapse