undefined | Better HN

0 pointsdragonwriter2y ago0 comments

Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.

0 comments

3 comments · 2 top-level

stavros2y ago· 1 in thread

Yes, but I also care about "can I load this onto my home GPU?" where, if I need all experts for this to run, the answer is "no".

MacsHeadroom2y ago

The answer is yes if you have a 24GB GPU. Just wait for 4bit quantization.

Or watch Tim Dettmers, who is releasing code to run Mixtral 8x7b in just 4GB of RAM.

MacsHeadroom2y ago

Ah good catch. Upon even closer examination, the attention layer (~2B params) is shared across experts. So in theory you would need 2B for the attention head + 5B for each of two experts in RAM.

That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.

j / k navigate · click thread line to collapse