A lot of people are reporting incredible results with the Qwen 3.5 MoE models on Apple hardware right now (streaming experts - see https://simonwillison.net/2026/Mar/24/streaming-experts/) - it would be great to get some of those models into that table.
Maybe the 1T parameter Kimi K2.5 too if you can get that to work, see https://twitter.com/seikixtc/status/2036246162936910322 and https://twitter.com/danpacary/status/2036480556045836603
> An error occurred in the application and your page could not be served. If you are the application owner, check your logs for details. You can do this from the Heroku CLI with the command
I get this error when I go to simonwillison.net
Any random blog/link works for example though: https://simonwillison.net/2026/Mar/19/openai-acquiring-astra...
(I checked your website because I wanted to see if you had written something about trivy/litellm as well, I highly recommend checking out what has happened within litellm space if possible as I would love to read your thoughts on it)
Have a nice day simon!
Edit: now the website works but I am not sure what had gone wrong previously, (an issue from heroku maybe?) as its working now
Edit-2: after the website working, I am able to see that you have already made a post about it.
for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.
still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.
Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.
Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.
Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.
I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.
"overloading NVMe"? What is that about? First time I've heard anything about it.
> because putting a ton of stress on your NVMe during generation
Really shouldn't "stress your NVMe", something is severely wrong if that's happening. I've been hammering my SSDs forever, and while write operations "hurt" the longevity of the flash cells themselves, the controller interface really shouldn't be affected by this at all, unless I'm missing something here.
For what Hypura does, the Max is the sweet spot. 64GB loads a 70B at Q4 with room to spare, and double the bandwidth of the Pro means generation is actually usable instead of just technically possible.
macOS doesn't have an "OOM killer" in that sense. (It has an out of swap space killer but it's pretty weak.)
So what will happen is, either your memory wiring will fail, or else it will get really slow and panic.
Come on, "Run" is not the right word. "Crawl" is.
Headlines like that are misleading.
You do not explain how any kind of predictor can work for MoE experts.
You do not explain how prediction can even be useful. I can predict the layers used in a dense model (all of them are used in order), but that doesn't help me much. It's still bottlenecked on bandwidth (hint: MoE doesn't change this).
What makes this approach faster is that the model's access pattern is completely deterministic during
inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal.
The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."
For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,
then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
expert 7. The neuron cache here is basically a domain-specific replacement policy.man 2 madvise