Also, I don't know of a general solution to streaming models from disk. Is there an inference engine that has this built-in in a way that is generally applicable for any model? I know (I mean, I've seen people say it, I haven't tried it) you can use swap memory with CPU offloading in llama.cpp, and I can imagine that would probably work...but definitely slowly. I don't know if it automatically handles putting the most important routing layers on the GPU before offloading other stuff to system RAM/swap, though. I know system RAM would, over time, come to hold the hottest selection of layers most of the time as that's how swap works. Some people seem to be manually splitting up the layers and distributing them across GPU and system RAM.
Have you actually done this? On what hardware? With what inference engine?