Skip to content
Better HN
Top
New
Best
Ask
Show
Jobs
Search
⌘K
DeepSeek's Multi-Head Latent Attention | Better HN
DeepSeek's Multi-Head Latent Attention
(opens in new tab)
(liorsinai.github.io)
4 points
the_origami_fox
1y ago
1 comments
Share
1 comments
default
newest
oldest
fspeech
1y ago
Matrix absorption is unnecessary. What is needed is the order of multiplication associates towards the direction of the absorption. This and the modified Rope are needed to make the caching work.
j
/
k
navigate · click thread line to collapse