undefined | Better HN

0 pointsmenaerus1y ago0 comments

> But your claim on that decoding is compute-bound is plainly wrong.

I did not say anything like that? What I said is that FlashAttention and arguably MLA will not make any significant gains in the inference time. And this is true.

Also, FWIW there are certainly model shapes that are compute-bound in the decode phase so saying that decoding is universally inherently bound by memory access is what is plain wrong, if I were to use your dictionary.

0 comments

3 comments · 1 top-level

rfoo1y ago· 2 in thread

Apologize if I got it wrong, but:

> MLA, FlashAttention and similar optimizations will provide the benefits only when memory access time dominates

> Those would be [...] not the decode phase

This does sound like you are saying that memory access time does NOT dominate during the decode phase. But it does.

Reading your quotes, it looks like maybe you are talking about GPU utilization issues? (i.e. not launching enough threads). Due to the parallelization strategy of the original FA it indeed does not even keep the GPU busy if q*bs is too small. But this is not an inherent limitation of FA-style kernels and can be solved and people did solve it. Or you simply batch more. Now you can keep the GPUs busy at 100% waiting for memory access, but memory access time still dominates, hence "memory-access-bound". And here comes MLA.

> FWIW there are certainly model shapes that are compute-bound in the decode phase

Yeah. But so far all I read don't really work ("work" means being at least just slightly worse than alternatives) under same wall-clock time compute budget. Do you have any pointer to a working example, even on smaller 3B-ish models?

menaerusOP1y ago

> This does sound like you are saying that memory access time does NOT dominate during the decode phase. But it does.

Let's take llama3-8B for an example. GFLOPS needed for self-attention per-layer per-token is roughly 0.15 GFLOPS. For simplicity reasons let's assume that we store all our weights in FP8 precision, then our load memory-bandwidth required for the same is 0.05 GB. Store memory-bandwidth is negligible. If we expand this further to a 1k tokens context, this becomes ~180 GFLOPS and ~0.35 GB per-layer per-1k-ctx.

Assuming that our HW is H100, is this compute-bound or memory-bound?

rfoo1y ago

You need to load cached k/v tensor, in addition to weights. It's going to take me some minutes to find out what's wrong in this napkin math. Will edit or reply this comment later.

1 more reply

j / k navigate · click thread line to collapse