MOE models don’t, in practice, selectively load experts on activation (and if a runtime for them could be designed that would do that, it would make them perform worse, since the experts activated may differ from token to token, so you’d be churning a whole lot swapping portions of the model into and out of VRAM.) But they do less computation per token for their size than monolithic so you can often get tolerable performance on CPU or split between GPU/CPU at a ratio that would work poorly with a similarly-sized monolithic model.
But, still, its going to need 262GB for weights + a variable amount based on context without quantization, and 66GB+ at 4-bit quantization.