It means it's 8 7B models in a trench coat in a sense, it runs as fast as a 14B (2 experts at a time apparently) but takes up as much memory as a 40B model (70% * 8 * 7B). There is some process trained into it that chooses which experts to use based on the question posed. GPT 4 is allegedly based on the same architecture, but at 8*222B.
In a MoE model with experts_per_token = 2 and each expert having 7B params, after picking the experts it should run as fast as the slowest 7B expert, not a comparable 14B model.
> GPT 4 is based on the same architecture, but at 8*222B.
Do we actually either no that it is MoE or that size? IIRC both if those started as outsidr guesses that somehow just became accepted knowledge without any actual confirmation.
Does anyone here know roughly how an expert gets chosen? It seems like a very open-ended problem, and I'm not sure on how it can be implemented easily.
It's just a rough estimate given that these things are fairly linear, the original 7B mistral was 15 GB and the new one is 86 GB, whereas a fully duplicated 8 * 15 GB would suggest a 120 GB size, so 86/120 = 0.71 for actual size, suggesting 29% memory savings. This of course doesn't really account for any multiple vs single file saving overhead and such, so it's likely to be a bit off.