undefined | Better HN

0 pointsttul2y ago0 comments

Someone smarter will probably correct me, but I don’t think that is how MoE works. With MoE, a feed-forward network assesses the tokens and selects the best two of eight experts to generate the next token. The choice of experts can change with each new token. For example, let’s say you have two experts that are really good at answering physics questions. For some of the generation, those two will be selected. But later on, maybe the context suggests you need two models better suited to generate French language. This is a silly simplification of what I understand to be going on.

0 comments

wongarsu2y ago

One viable strategy might be to offload as many experts as possible to the GPU, and evaluate the other ones on the CPU. If you collect some statistics which experts are used most in your use cases and select those for GPU acceleration you might get some cheap but notable speedups over other approaches.

ttulOP2y ago

This being said, presumably if you’re running a huge farm of GPUs, you could put each expert onto its own slice of GPUs and orchestrate data to flow between GPUs as needed. I have no idea how you’d do this…

alchemist1e92y ago

Ideally those many GPUs could be on different hosts connected with a commodity interconnect like 10gbe.

If MOE models do well it could be great for commodity hw based distributed inference approaches.

Philpax2y ago

Yes, that's more or less it - there's no guarantee that the chosen expert will still be used for the next token, so you'll need to have all of them on hand at any given moment.

j / k navigate · click thread line to collapse