undefined | Better HN

0 pointsein0p1y ago0 comments

The coarse estimate of compute in transformers is about as many MACs as there are weights, or twice as many flops (because multiplication and addition are counted as separate operations). So for llama 70b that’s about 70b MACs per token, which is manageable. What’s far less manageable is reading the entire model into RAM N times a second

0 comments

2 comments · 1 top-level

GaggiX1y ago· 1 in thread

This would only be the case if we ignore the multiplication between queries and keys, and the resulting vector being multiple with the values, and also the multiple heads.

ein0pOP1y ago

No, that is always the case. Attention is only about one third the ops and qk is a fraction of that. Outside of truly massive sequence lengths it doesn’t matter a whole lot, even though it’s nominally quadratic. It’s trivial to run the numbers on this - you only need to do it for one layer.

j / k navigate · click thread line to collapse