Personally, I'd put all the parameters in NOR flash, then cycle through the row lines sequentially to load the parameters into the MAC. You could load all the inputs in parallel as fast as the dynamic power limits of the chip allow. If you use either DMA or a hardware ring buffer to push all the tokens through the layers, you could keep the throughput going with various sizes of models, etc.
Obviously with only one MAC you couldn't have a single stream at a GHZ, but you could have 4000 separate streams of 250,000 tokens/second.