undefined | Better HN

0 pointsEnPissant1mo ago0 comments

Even if you could fit a 500B model's expert weights in very fast system RAM, it would run so slow as to be useless.

0 comments

3 comments · 1 top-level

zozbot2341mo ago· 2 in thread

That's really only "useless" if the only thing you care about is a quick real-time response. Contrary to common perception, MoE models do benefit from batching requests together even when run on a single node, you just have to ensure you have at least ~5 parallel requests in flight (and that's for the very sparsest models) to really see the aggregate benefit.

(Intuitively, that's because the issue of whether any active weights are being shared among requests - thus, any memory throughput is being reused - is a generalized birthday problem. That's why even having a few parallel requests is quite effective. Especially since the "random" choice of experts happens anew at any single layer, so there's a lot of independent samples.)

EnPissantOP1mo ago

This is just wishful thinking.

For prefill, it's really easy to batch MoE and get really good tk/s, even on a single stream.

For decode, you will run into the problem that:

1) you need more parallel requests which means more memory for context

2) 5 requests will not give you very much expert overlap on parallel requests

zozbot2341mo ago

You don't need "very much" expert overlap to see aggregate gains at scale, you just need some of it; that's where the "birthday" framing becomes relevant. Memory for context is an issue, but recent models like DeepSeek V4 use very little of it even at relatively large contexts.

1 more reply

j / k navigate · click thread line to collapse