Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon (opens in new tab)

(github.com)

221 pointstatef3mo ago85 comments

85 comments

59 comments · 21 top-level

baq3mo ago· 8 in thread

Intel Optane rolling in its grave.

aitchnyu3mo ago

Memristors are also missing in this AI hype even when they were around the corner 10 years back.

moffkalast3mo ago

Wouldn't be Intel if they didn't quit halfway through on a good thing.

Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.

liuliu3mo ago

Still have 4 brand new ones in my storage unit. Just in case these moments.

Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.

zozbot2343mo ago

It's not about being faster (except for small reads where latency dominates, which is actually relevant when reading a handful of expert-layers immediately after routing), it's the wearout resistance which opens up the possibility of storing KV-cache (including the "linear" KV-cache of recent Qwen, which is not append-only as it was with the pure attention model) and maybe even per-layer activations - though this has the least use given how ephemeral these are.

speedgoose3mo ago

Is it too late for Intel to bring them back to life?

c0balt3mo ago

Yes, their NAND division has been sold, it is now mostly under solidigm. Maybe solidigm could bring it back, but it seems unlikely (given the previous commercial failure).

walterbell3mo ago

Nvidia and SK Hynix are bringing HBF to market for $$.

0ptan33mo ago

pmem

Insanity3mo ago· 5 in thread

This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.

I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.

zozbot2343mo ago

This is not putting any stress or wear on the NVMe, it's a pure read workload.

tatefOP3mo ago

Yes, exactly this.

embedding-shape3mo ago

> but in a 'smart' way so you don't overload the NVMe unnecessarily

"overloading NVMe"? What is that about? First time I've heard anything about it.

> because putting a ton of stress on your NVMe during generation

Really shouldn't "stress your NVMe", something is severely wrong if that's happening. I've been hammering my SSDs forever, and while write operations "hurt" the longevity of the flash cells themselves, the controller interface really shouldn't be affected by this at all, unless I'm missing something here.

tatefOP3mo ago

Hypura reads tensor weights from the GGUF file on NVMe into RAM/GPU memory pools, then compute happens entirely in RAM/GPU.

There is no writing to SSDs on inference with this architecture.

1 more reply

Insanity3mo ago

I had assumed heat generation on the controller if it's continuously reading. But maybe it's not actually bad.

1 more reply

simonw3mo ago· 4 in thread

Suggestion for the maintainers: the comparison table currently lists some pretty old models, Qwen 2.5 14B and Mixtral 8x7B and Llama 3.3 70B.

A lot of people are reporting incredible results with the Qwen 3.5 MoE models on Apple hardware right now (streaming experts - see https://simonwillison.net/2026/Mar/24/streaming-experts/) - it would be great to get some of those models into that table.

Maybe the 1T parameter Kimi K2.5 too if you can get that to work, see https://twitter.com/seikixtc/status/2036246162936910322 and https://twitter.com/danpacary/status/2036480556045836603

Imustaskforhelp3mo ago

Simon, A little offtopic but it seems that your website isn't working.

> An error occurred in the application and your page could not be served. If you are the application owner, check your logs for details. You can do this from the Heroku CLI with the command

I get this error when I go to simonwillison.net

Any random blog/link works for example though: https://simonwillison.net/2026/Mar/19/openai-acquiring-astra...

(I checked your website because I wanted to see if you had written something about trivy/litellm as well, I highly recommend checking out what has happened within litellm space if possible as I would love to read your thoughts on it)

Have a nice day simon!

Edit: now the website works but I am not sure what had gone wrong previously, (an issue from heroku maybe?) as its working now

Edit-2: after the website working, I am able to see that you have already made a post about it.

tatefOP3mo ago

Thanks for sharing this! If you'd be interested in running the benchmark yourself with Hypura I'd happily merge into our stats. Otherwise will add to my todo list :)

abtinf3mo ago

The lack of a token rate metric for the kimi example is disappointing.

zozbot2343mo ago

The latter link says they get ~1.7 tok/s which is quite impressive for a near-SOTA local model running on ordinary hardware.

vanyaland3mo ago· 3 in thread

For a lot of local workloads, sub-1 tok/s is useless in foreground and perfectly acceptable in background. If the choice is “this crashes” vs “this finishes overnight,” that’s still a meaningful capability jump.

joelthelion3mo ago

How much are you going to spend on electricity though? Is this really going to be more cost-effective than just using openrouter?

austinthetaco3mo ago

There are many other reasons someone might want to run a model locally outside of cost savings, ownership of data flow and use in locations without internet to name a couple.

hadlock3mo ago

If my options are run Opus 4.6 in the cloud for $200/mo or run Opus 4.6 locally for $275, I am absolutely going to self-host 100% of the time. Sending all that data to the cloud presents tremendous legal risk for companies. There's currently no retention rules about privately hosted AI.

root_axis3mo ago· 3 in thread

Are there any 1T parameter open source models?

zozbot2343mo ago

Kimi 2.5?

ai-inquisitor3mo ago

That model is "open weight", not open source. We have no idea what data Moonshot trained on.

1 more reply

root_axis3mo ago

Thanks, TIL.

amelius3mo ago· 3 in thread

This is <1 tok/s for the 40GB model.

Come on, "Run" is not the right word. "Crawl" is.

Headlines like that are misleading.

feznyng3mo ago

Could still be useful; maybe for overnight async workloads? Tell your agent research xyz at night and wake up to a report.

maleldil3mo ago

Assuming 1 token per second and "overnight" being 12 hours, that's 43 200 tokens. I'm not sure what you can meaningfully achieve with that.

1 more reply

smlacy3mo ago

Yes, and with virtually zero context, which makes an enormous difference for TTFT on the MoE models.

monksy3mo ago· 3 in thread

There needs to be something like this from Ollama. At the moment Ollama has a lot of flaws that prevent it from getting great performance. (My understanding is better GPU/CPU splits, etc). But Ollama is the only way to host an LLM and have it switch out on demand. Sigh.

zozbot2343mo ago

Ollama has very substandard support for mmap at present, which hurts inference with larger models. There are some recent pull requests in flight that should help address this to at least some extent https://github.com/ollama/ollama/pull/14525 https://github.com/ollama/ollama/pull/14134 https://github.com/ollama/ollama/pull/14864 but progress seems to be stalling. Their support for recent Qwen models seems to also have some bespoke incompatibilities with llama.cpp, which doesn't help matters; it's difficult to test the same model with both.

rubiquity3mo ago

llama.cpp and llama-swap do this better than Ollama and with far more control.

circularfoyers3mo ago

Don't even need to use llama-swap anymore now that llama-server supports the same functionality.

1 more reply

anshulbasia273mo ago· 3 in thread

OS paging would be significantly worse here. The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch. You stall on every fault, wait for the 4KB/16KB page to load, then resume. With 80 layers of dense FFN streaming, that's thousands of cold faults per token.

  What makes this approach faster is that the model's access pattern is completely deterministic during         
  inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
  you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal. 
  The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."

  For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,  
  then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
  expert 7. The neuron cache here is basically a domain-specific replacement policy.

zozbot2343mo ago

> The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch.

man 2 madvise

astrange3mo ago

That works for readahead but it's not good for random access. readv, aio, dispatch_io are better there.

1 more reply

EnPissant3mo ago

That assumes you have significant work to do between fetches (so you can prefetch while using the current data). With LLM decode you don't.

marksully3mo ago· 2 in thread

Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

tatefOP3mo ago

I'm referencing it as being possible, however I didn't share benchmarks because candidly the performance would be so slow it would only be useful for very specific tasks over long time horizons. The more practical use cases are less flashy but capable of achieving multiple tokens/sec (ie smaller MoE models where not all experts need to be loaded in memory simultaneously)

causal3mo ago

Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...

zozbot2343mo ago· 2 in thread

It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.

salynchnew3mo ago

It was written by an LLM, so... yeah.

jeffybefffy5193mo ago

Except this isnt using heavily quantised versions of the model thus reducing quality.

dev_tools_lab3mo ago· 2 in thread

Nice work on the scheduler. Have you benchmarked parallel inference across multiple models? Running GPT, Claude and Gemini simultaneously on the same input is where latency becomes a real constraint.

zozbot2343mo ago

GPT-OSS exists but Claude and Gemini aren't available locally, lol.

dev_tools_lab3mo ago

True, Claude and Gemini aren’t local yet — I mostly meant running all available local models in parallel.

Even with just open-source LLMs, you can see interesting differences in flagged issues when cross-validating outputs.

shubhamintech3mo ago

The MoE point matters here ie sparse activation means you're not reading all 2TB per forward pass, but the access pattern flips from sequential to random which is exactly the worst case for NVMe. Been thinking about this a lot for agent inference workloads where you want consistent latency more than peak throughput.

msbhogavi3mo ago

"As much memory as possible" is right for model capacity but misses bandwidth. Apple Silicon has distinct tiers: M4 Pro at 273 GB/s, M4 Max at 546 GB/s, M4 Ultra at 819 GB/s. Bandwidth determines tok/s once the model fits in memory. An M4 Max gives you 2x the decode speed of an M4 Pro on the same model.

For what Hypura does, the Max is the sweet spot. 64GB loads a 70B at Q4 with room to spare, and double the bandwidth of the Pro means generation is actually usable instead of just technically possible.

astrange3mo ago

> Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.

macOS doesn't have an "OOM killer" in that sense. (It has an out of swap space killer but it's pretty weak.)

So what will happen is, either your memory wiring will fail, or else it will get really slow and panic.

dev_tools_lab3mo ago

Thanks for this project. Prioritizing MoE models and adding an intelligent NVMe cache could improve efficiency, especially on the M4 Max where bandwidth makes usage more realistic.

EnPissant3mo ago

You do not provide any comparison to llama.cpp with mmap.

You do not explain how any kind of predictor can work for MoE experts.

You do not explain how prediction can even be useful. I can predict the layers used in a dense model (all of them are used in order), but that doesn't help me much. It's still bottlenecked on bandwidth (hint: MoE doesn't change this).

dangoodmanUT3mo ago

With unified memory and such a strong os-hardware integration, one would hope that swap could handle this task

nullbyte3mo ago

I am curious how the TPS compares vs default OS virtual memory paging

speedgoose3mo ago

I wonder how many minutes per token on GLM 5.

solozaki3mo ago

hello

erikcw3mo ago

Simon Willison wrote a good post about Dan Woods’ work on “Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally”.

[0] https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

j / k navigate · click thread line to collapse

85 comments

59 comments · 21 top-level

baq3mo ago· 8 in thread

Intel Optane rolling in its grave.

aitchnyu3mo ago

Memristors are also missing in this AI hype even when they were around the corner 10 years back.

moffkalast3mo ago

Wouldn't be Intel if they didn't quit halfway through on a good thing.

Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.

liuliu3mo ago

Still have 4 brand new ones in my storage unit. Just in case these moments.

zozbot2343mo ago

speedgoose3mo ago

Is it too late for Intel to bring them back to life?

c0balt3mo ago

Yes, their NAND division has been sold, it is now mostly under solidigm. Maybe solidigm could bring it back, but it seems unlikely (given the previous commercial failure).

walterbell3mo ago

Nvidia and SK Hynix are bringing HBF to market for $$.

0ptan33mo ago

pmem

Insanity3mo ago· 5 in thread

This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.

I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.

zozbot2343mo ago

This is not putting any stress or wear on the NVMe, it's a pure read workload.

tatefOP3mo ago

Yes, exactly this.

embedding-shape3mo ago

> but in a 'smart' way so you don't overload the NVMe unnecessarily

"overloading NVMe"? What is that about? First time I've heard anything about it.

> because putting a ton of stress on your NVMe during generation

tatefOP3mo ago

Hypura reads tensor weights from the GGUF file on NVMe into RAM/GPU memory pools, then compute happens entirely in RAM/GPU.

There is no writing to SSDs on inference with this architecture.

1 more reply

Insanity3mo ago

I had assumed heat generation on the controller if it's continuously reading. But maybe it's not actually bad.

1 more reply

simonw3mo ago· 4 in thread

Suggestion for the maintainers: the comparison table currently lists some pretty old models, Qwen 2.5 14B and Mixtral 8x7B and Llama 3.3 70B.

Maybe the 1T parameter Kimi K2.5 too if you can get that to work, see https://twitter.com/seikixtc/status/2036246162936910322 and https://twitter.com/danpacary/status/2036480556045836603

Imustaskforhelp3mo ago

Simon, A little offtopic but it seems that your website isn't working.

> An error occurred in the application and your page could not be served. If you are the application owner, check your logs for details. You can do this from the Heroku CLI with the command

I get this error when I go to simonwillison.net

Any random blog/link works for example though: https://simonwillison.net/2026/Mar/19/openai-acquiring-astra...

Have a nice day simon!

Edit: now the website works but I am not sure what had gone wrong previously, (an issue from heroku maybe?) as its working now

Edit-2: after the website working, I am able to see that you have already made a post about it.

tatefOP3mo ago

Thanks for sharing this! If you'd be interested in running the benchmark yourself with Hypura I'd happily merge into our stats. Otherwise will add to my todo list :)

abtinf3mo ago

The lack of a token rate metric for the kimi example is disappointing.

zozbot2343mo ago

The latter link says they get ~1.7 tok/s which is quite impressive for a near-SOTA local model running on ordinary hardware.

vanyaland3mo ago· 3 in thread

joelthelion3mo ago

How much are you going to spend on electricity though? Is this really going to be more cost-effective than just using openrouter?

austinthetaco3mo ago

There are many other reasons someone might want to run a model locally outside of cost savings, ownership of data flow and use in locations without internet to name a couple.

hadlock3mo ago

root_axis3mo ago· 3 in thread

Are there any 1T parameter open source models?

zozbot2343mo ago

Kimi 2.5?

ai-inquisitor3mo ago

That model is "open weight", not open source. We have no idea what data Moonshot trained on.

1 more reply

root_axis3mo ago

Thanks, TIL.

amelius3mo ago· 3 in thread

This is <1 tok/s for the 40GB model.

Come on, "Run" is not the right word. "Crawl" is.

Headlines like that are misleading.

feznyng3mo ago

Could still be useful; maybe for overnight async workloads? Tell your agent research xyz at night and wake up to a report.

maleldil3mo ago

Assuming 1 token per second and "overnight" being 12 hours, that's 43 200 tokens. I'm not sure what you can meaningfully achieve with that.

1 more reply

smlacy3mo ago

Yes, and with virtually zero context, which makes an enormous difference for TTFT on the MoE models.

monksy3mo ago· 3 in thread

zozbot2343mo ago

rubiquity3mo ago

llama.cpp and llama-swap do this better than Ollama and with far more control.

circularfoyers3mo ago

Don't even need to use llama-swap anymore now that llama-server supports the same functionality.

1 more reply

anshulbasia273mo ago· 3 in thread

  What makes this approach faster is that the model's access pattern is completely deterministic during         
  inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
  you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal. 
  The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."

  For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,  
  then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
  expert 7. The neuron cache here is basically a domain-specific replacement policy.

zozbot2343mo ago

> The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch.

man 2 madvise

astrange3mo ago

That works for readahead but it's not good for random access. readv, aio, dispatch_io are better there.

1 more reply

EnPissant3mo ago

That assumes you have significant work to do between fetches (so you can prefetch while using the current data). With LLM decode you don't.

marksully3mo ago· 2 in thread

Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

tatefOP3mo ago

causal3mo ago

Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...

zozbot2343mo ago· 2 in thread

salynchnew3mo ago

It was written by an LLM, so... yeah.

jeffybefffy5193mo ago

Except this isnt using heavily quantised versions of the model thus reducing quality.

dev_tools_lab3mo ago· 2 in thread

Nice work on the scheduler. Have you benchmarked parallel inference across multiple models? Running GPT, Claude and Gemini simultaneously on the same input is where latency becomes a real constraint.

zozbot2343mo ago

GPT-OSS exists but Claude and Gemini aren't available locally, lol.

dev_tools_lab3mo ago

True, Claude and Gemini aren’t local yet — I mostly meant running all available local models in parallel.

Even with just open-source LLMs, you can see interesting differences in flagged issues when cross-validating outputs.

shubhamintech3mo ago

msbhogavi3mo ago

astrange3mo ago

macOS doesn't have an "OOM killer" in that sense. (It has an out of swap space killer but it's pretty weak.)

So what will happen is, either your memory wiring will fail, or else it will get really slow and panic.

dev_tools_lab3mo ago

Thanks for this project. Prioritizing MoE models and adding an intelligent NVMe cache could improve efficiency, especially on the M4 Max where bandwidth makes usage more realistic.

EnPissant3mo ago

You do not provide any comparison to llama.cpp with mmap.

You do not explain how any kind of predictor can work for MoE experts.

dangoodmanUT3mo ago

With unified memory and such a strong os-hardware integration, one would hope that swap could handle this task

nullbyte3mo ago

I am curious how the TPS compares vs default OS virtual memory paging

speedgoose3mo ago

I wonder how many minutes per token on GLM 5.

solozaki3mo ago

hello

erikcw3mo ago

Simon Willison wrote a good post about Dan Woods’ work on “Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally”.

[0] https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

j / k navigate · click thread line to collapse