DeepSeek's Multi-Head Latent Attention (opens in new tab)

(liorsinai.github.io)

4 pointsthe_origami_fox1y ago1 comments

1 comments

Matrix absorption is unnecessary. What is needed is the order of multiplication associates towards the direction of the absorption. This and the modified Rope are needed to make the caching work.

j / k navigate · click thread line to collapse

1 comments

fspeech1y ago

Matrix absorption is unnecessary. What is needed is the order of multiplication associates towards the direction of the absorption. This and the modified Rope are needed to make the caching work.

j / k navigate · click thread line to collapse