DeepSeek 4 Flash local inference engine for Metal (opens in new tab)

(github.com)

484 pointstamnd2d ago152 comments

152 comments

Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.

Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

Aurornis2d ago

> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.

There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.

xtracto2d ago

This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

Juvination2d ago

I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.

slaw2d ago

Check out cpp at 208.3 GiB/s, 3x faster than asm.

mirsadm2d ago

I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.

wahnfrieden2d ago

Just curious if you've tried GPT 5.5 Pro?

joshmarlow2d ago

Another suggestion for optimizing local inference - the Hermes team talks a lot on X about how much better results are when you use custom parsers tuned to the nuances of each model. Some models might like to use a trailing `,` in JSON output, some don't - so if your parser can handle the quirks of the specific model, then you get higher-performing functionality.

didip2d ago

What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?

p_stuart822d ago

this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.

lhl2d ago

I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.

I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.

Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.

I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.

tmzt1d ago

Doing the same for Apple M-series with fused wgsl shaders specifically targeting Qwen3/3.5.

My effort is called shady-thinker and is on github at github.com/tmzt/shady-thinker.

This was inspired in part by Antirez's earlier work with C kernels as well as other efforts to support in-browser LLMs. I've adapted them to Rust and the wgpu library.

Gemma 4 is also the next likely target (with the MTP work) as I'm experimenting with local AI agents.

I'd love to see what you've done to improve prefill and decode even if its not directly applicable.

One difference, I'm using MLX and GPTQ 4bit quants including AutoRound with safetensors as my shader pipeline is pretty much fixed for each model, ggml just adds unnecessary complexity.

ljosifov2d ago

In the same boat with 7900xtx. 24GB vram, on paper decent performance, in reality most things don't run. Only llama.cpp is consistent that it can run most models, even if maybe not at top performance (afaik - lacking MTP, problems cache invalidation with hybrid models). At least with llama.cpp I know what runs. With various python-based inferencers, between their uv/venv, my venv, system envs/pythons/libs yadayada - I need an agent to get to the bottom of what's actually running. :-) Yeah IK skill issue/user errors - but don't have seconds in the day left to spend them on that.

Even if not perfect, if you publish on GH or HF, some other agent can maybe start there and not from zero. I did this for Ling-2.6-flash (107B-A7B4 MoE) that's the biggest llm I can ran for practical use on the other h/w I got for local llms (M2 Max). Even if MTP is not working well, still improvement on the current llama.cpp that does not run Ling-2.6-flash at all. This - https://huggingface.co/inclusionAI/Ling-2.6-flash/discussion.... The 4-bit quants are at https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, the branch is at https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas....

throwa3562622d ago

Please share your knowledge and your findings

I think llama.cpp could have done a much better job supporting PC. Sure, some of it us due to bad vendor support but with so many users I am surprised we don't see more optimized inference on standard PCs

lhl2d ago

When it's in a good state I'll open source it, I am keeping track of what optimizations make the most impact, stuff like this:

### Diagnosing parallelism pathologies (L1)

*Grid occupancy:* - `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)? - < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization will NOT help. - 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure. - 1.0-4.0 = healthy; micro-optimization can help.

*Within-block distribution:* - Does the kernel do useful work across all threads, or is there an `if (threadIdx.x == 0)` gate around a serial top-k, reduction, or scan? For c=1 decode, many kernels can't grow the grid, but they can always parallelize inside the block. - `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a strong secondary signal of the within-block pathology.

*Router top-k (within-block fix)*: - Kernel: `qwen35_router_select_kernel` @ c=1 decode - Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)` gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays. - Fix: warp-shuffle parallel argmax across the whole block + `__shared__` top_vals buffer eliminating the spill. - Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.

maherbeg2d ago

This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.

dakolli2d ago

There will always be a huge gap between frontier models and open source models (unless you're very rich). This whole industry makes no sense, everyone is ignoring the unit economics. It cost 20k a month to running Kimi 2.6 at decent tok/ps, to sell those tokens at a profit you'd need your hardware costs to be less 1k a month.

Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.

bensyverson2d ago

If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

physicsguy2d ago

It also massively changes the value economics of the frontier models. In a lot of cases, you really don't need a general purpose intelligence model too.

1 more reply

liuliu2d ago

I am not sure where this comment is from (possibly without looking at this project?). This project is running quasi-frontier model at reasonable tps (~30) with reasonable prefill performance (~500tps) with a high-end laptop. People simply project what they see from this project to what you optimistically can expect.

You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.

amunozo2d ago

Most tasks do not require frontier models, so as long as these models cover 95-99 per cent of the tasks, closed frontier models can be left for niche and specialized cases that are harder.

otabdeveloper42d ago

> a delusional future where capable OS models fit on consumer grade hardware

48 gb is enough for a capable LLM.

Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.

antirez2d ago

A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...

minimaxir2d ago

"Data centers for LLMs are technically more energy efficient per-user than self-hosting LLM models due to economies-of-scale" is a data point the internet isn't ready for.

bertili2d ago

equals 2 or 3 human brains in power usage. Amazing work!

Hamuko2d ago

I think I’ve seen about 60 watt total system whenever I’ve used a local model on a MacBook Pro or a Mac Studio. Baseline for the Mac Studio is like 10 W and like 6 W for the MacBook Pro.

sev_verso1d ago

I've tried it out with Claude Code on my existing codebase and it seemed to hold its weight (despite being the 2-bit quant). Takes minutes on prompt processing, the actual edits are reasonably quick at above 20 tks.

The good: It succeeded with discovering, applying edits and writing a test for a small task I gave it. The bad: It could not address a small nitpick I had. The ugly: It hallucinated a conversation about "The Duck" that I had with it simultaneously while trying to solve another problem. I can only imagine it's one of examples in the initial Claude Code prompt:

--cut-- However, the user's query is "Can you track these 3 videos here?" which seems unrelated. Perhaps the user is asking if I can track the progress of three videos they are working on?

Let me re-read the user's message. The user said "Source Code" and "The Agent" and "The Duck", it could be video titles. And they are asking if I can track these 3 videos.

?? That doesn't make sense in the context. Could there be two different conversations? --cut--

nazgulsenpai2d ago

I keep seeing DS4 and in order my brain interprets it as Dark Souls 4 (sadface), DualShock 4, Deep Seek 4.

1 more reply

layoric2d ago

Very impressive. One thing that seems odd to me is that is at like 4 minutes before it starts a response for large input? I don't use mac hardware for LLMs, but that is quite surprising and would seem to be a pretty large stumbling block for practical usage.

Edit: Caching story makes a lot more sense for regular usage: > Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.

antirez2d ago

Yep that happens with coding agents sending a very large system prompt. And also when later tool calling feed it large files or diffs. But with the M3 ultra the prefill speed is almost 500 t/s that is quite into the very usable zone. With M3 max you need a bit more patience but it works well and as it emits the think process if you use the pi agent you don't wait: you read non censored chain of though. I posted a video on X yesterday using it with my m3 max. It spills tokens at a decent speed.

zozbot2342d ago

Given how small the KV cache for this model seems to be for small contexts, can you clarify how the engine behaves if you try to run increasingly larger batches on your prosumer hardware (RAM 128 GB)? Does it eventually become compute limited?

Also, can the engine support transparent mmap use for fetching weights from disk on-demand, at least when using pure CPU? (GPU inference might be harder, since it's not clear how page faults would interact with running a shader.)

If the latter test is successful, next would be testing Macs with more limited RAM, first running simple requests (would be quite slow) then larger batches (might be more worthwhile if one can partially amortize the cost of fetching weights from storage, and be bottlenecked by other factors).

segmondy1d ago

Curious why you went this route, don't you think you could have achieved near this performance 80%+ or more within llama.cpp?

MrBuddyCasino2d ago

Prefill is faster on the M5s, the older generations are a bit weak.

visarga2d ago

Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.

antirez2d ago

DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.

bel82d ago

And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

antirez2d ago

It runs both q2 and original (4 bit routed experts). At the same speed more or less. The q2 quants are not what you could expect: it works extremely well for a few reasons. For the full model you need a Mac with 256GB.

brcmthrowaway2d ago

Why is this the case?

Are there any architectures that don't rely on feeding the entire history back into the chat?

Recurrent LLMs?

kristianp2d ago

Hmm, I'm unable to order more than 96GB RAM for a Mac studio, even with the M3 ultra or M4 Max. Is this au specific? However with the MacBook Pro I can specify 128GB with the M5 Mac.

https://www.apple.com/au/shop/buy-mac/mac-studio

Joeri2d ago

It’s not just AU: https://9to5mac.com/2026/05/05/apples-most-powerful-mac-stud...

They’ve dropped all the mac studio configs higher than 96 gb, as well as the base mac mini. They’re also rumored to be considering taking the Neo base config off the market.

This seems to be how they’re dealing with supply constraints for fab capacity and RAM.

Terretta1d ago

Difficult to believe this memory is made of unobtanium.

Maybe Apple would rather not price it at all than experience blowback for either gouging or lack of inventory.

smcleod2d ago

The studio is really old now. The new one will drop at some point no doubt with more memory options. the 128GB M5 max MBP is great though

Terretta1d ago

And yet, aside from offering 512GB, that really old Studio Ultra M3 LLMs faster (especially sustained) than the new M5 Max.

Havoc2d ago

Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff

zozbot2342d ago

It could run viably with SSD offload on Macs with very little memory. You could even exploit batching to make the model almost compute limited even in that challenging setting, seeing as the KV cache is so extremely small (for non-humongous context). In fact, if that approach can be made to work I'd like to see a comparison between DS4 Flash and Pro on the same (Mac) hardware.

Havoc2d ago

>It could run viably with SSD offload on Macs with very little memory

Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range

Lovely as modern nvmes are they're not memory

zozbot2342d ago

You can run multiple inferences in parallel on the same set of weights, that's what batching is. Given enough parallelization it can be almost entirely compute-limited, at least for small context (max ~10GB per request apparently, but that's for 1M tokens!)

amunozo2d ago

I am curious about it producing less tokens except for the max mode. I love DeepSeek V4 Flash and I use it extensively, it's so cheap I can use it all day and still not use all my 10$ OpenCode Go subscription. I use it always in max mode because of this, but now I wonder whether I should rather use high.

unshavedyak2d ago

What do you use it for? I tend to just stick to SOTA (Claude 4.7 Max thinking), and put up with the slow req/response. I'm not sure what type of work i'd trust a less thinking model, as my intuition is built around what Claude vSOTA Max can handle.

Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.

edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.

amunozo2d ago

I am experimenting with some game development and my thesis' beamer. I have a 20$ Codex account and I use GPT-5.5 for planning and DeepSeek for executing in OpenCode. This makes my Codex 5h tokens to last more than 10 minutes.

actsasbuffoon2d ago

Apple just dropped the 128GB option as well.

PhilippGille2d ago

On max it uses more than twice as many tokens as on high when running the ArtificialAnalysis benchmark suite, and then it's indeed the model with the highest token usage (among the current top tier models). See the "Intelligence vs. Token Use" chart here:

https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...

amunozo2d ago

Wow, the difference is quite considerable and the gain in intelligence is not that much. I might try to use high and just iterate more often. I am working with hobby stuff so I don't have to worry whether it breaks things or not.

syntaxing2d ago

How has opencode go been for you? Worth changing over from Claude pro?

DefineOutside2d ago

I've found that opencode and codex are the two subscriptions that still seem to subsize usage. Deepseek V4 has been the most powerful model in opencode IMO, I trust it with problems where I can validate the solution such as debugging an issue - but I only trust the proprietary GPT-5.5 and Claude Opus 4.7 models for writing code that matters.

amunozo2d ago

Given the price, extremely satisfied, especially thanks to DeepSeek V4 Flash that makes it last forever. I use it on top of my 20$ Codex which is great but tokens last nothing.

ZeroGravitas2d ago

Did I miss a simple motivating benchmark or goal?

I'm assuming this is faster, and/or lets you run a bigger, smarter model than just using the generic tool chain, but it doesn't spell out the level of existing improvements over that baseline or expected improvements as far as I can see?

Presumably you can work it out based on the numbers given if you have the relevant comparison values.

fgfarben1d ago

On both the llama.cpp based version and the custom Metal version, the model forgets how to use tools somewhere around the 50,000 token mark.

dejli2d ago

The beaty of it, that you can clone and make it, and it just works, no python shenanigans, what a blessing for this eco system.

tmaly1d ago

The intro was the best part of the README in my opinion. The rest of the README looks and feels AI generated. I am guilty of this same thing with README files.

shivnathtathe2d ago

Been working on local-first LLM observability for exactly this use case — tracing local model pipelines without sending data to cloud. Happy to share if anyone's interested.

sourcecodeplz2d ago

Great project!

This is also a fine example of a vibe-coded project with purpose, as you acknowledged.

octocop2d ago

Finally someone who pays proper respect to GGML ecosystem.

npgraph1d ago

Any direct TPS comparison to Ollama?

1 more reply

brcmthrowaway2d ago

How does this compare with oMLX?

happyPersonR2d ago

So just gonna ask a question, probably will get downvoted

I know this is flash, but….

But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?

Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK

Sidenote: shout out antirez love my redis :)

AlotOfReading2d ago

This is built atop a tower of stuff people built with profiling and performance-oriented design.

That said, I've found that most corporate environments are unintentionally hostile to this kind of optimization work. It's hard to justify until the work is already done. That means you often need people with the skills, means, and motivation to do this that are outside normal corporate constraints. There aren't many of those.

happyPersonR2d ago

Building this into agentic dev workflows (subject to token/time constraints) is something I spent a lot of time doing at work. I actually am kind of proud of that hahah

But you’re right I agree

In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen

Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.

A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space

I’m happy to see more stuff like this :)

TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?

liuliu2d ago

DSv4 generates much faster on NVIDIA class hardware. It is just a very efficient model.

wmf2d ago

Every lab has a bunch of people doing nothing but optimizing.

fgfarben2d ago

The world is not China.

j / k navigate · click thread line to collapse

152 comments

kgeist2d ago

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

Aurornis2d ago

> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

xtracto2d ago

This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

Juvination2d ago

I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.

slaw2d ago

Check out cpp at 208.3 GiB/s, 3x faster than asm.

mirsadm2d ago

I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.

wahnfrieden2d ago

Just curious if you've tried GPT 5.5 Pro?

joshmarlow2d ago

didip2d ago

What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?

p_stuart822d ago

this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.

lhl2d ago

I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.

I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.

tmzt1d ago

Doing the same for Apple M-series with fused wgsl shaders specifically targeting Qwen3/3.5.

My effort is called shady-thinker and is on github at github.com/tmzt/shady-thinker.

This was inspired in part by Antirez's earlier work with C kernels as well as other efforts to support in-browser LLMs. I've adapted them to Rust and the wgpu library.

Gemma 4 is also the next likely target (with the MTP work) as I'm experimenting with local AI agents.

I'd love to see what you've done to improve prefill and decode even if its not directly applicable.

One difference, I'm using MLX and GPTQ 4bit quants including AutoRound with safetensors as my shader pipeline is pretty much fixed for each model, ggml just adds unnecessary complexity.

ljosifov2d ago

throwa3562622d ago

Please share your knowledge and your findings

lhl2d ago

When it's in a good state I'll open source it, I am keeping track of what optimizations make the most impact, stuff like this:

### Diagnosing parallelism pathologies (L1)

maherbeg2d ago

dakolli2d ago

bensyverson2d ago

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

physicsguy2d ago

It also massively changes the value economics of the frontier models. In a lot of cases, you really don't need a general purpose intelligence model too.

1 more reply

liuliu2d ago

You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.

amunozo2d ago

Most tasks do not require frontier models, so as long as these models cover 95-99 per cent of the tasks, closed frontier models can be left for niche and specialized cases that are harder.

otabdeveloper42d ago

> a delusional future where capable OS models fit on consumer grade hardware

48 gb is enough for a capable LLM.

Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.

antirez2d ago

A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...

minimaxir2d ago

"Data centers for LLMs are technically more energy efficient per-user than self-hosting LLM models due to economies-of-scale" is a data point the internet isn't ready for.

bertili2d ago

equals 2 or 3 human brains in power usage. Amazing work!

Hamuko2d ago

I think I’ve seen about 60 watt total system whenever I’ve used a local model on a MacBook Pro or a Mac Studio. Baseline for the Mac Studio is like 10 W and like 6 W for the MacBook Pro.

sev_verso1d ago

--cut-- However, the user's query is "Can you track these 3 videos here?" which seems unrelated. Perhaps the user is asking if I can track the progress of three videos they are working on?

Let me re-read the user's message. The user said "Source Code" and "The Agent" and "The Duck", it could be video titles. And they are asking if I can track these 3 videos.

?? That doesn't make sense in the context. Could there be two different conversations? --cut--

nazgulsenpai2d ago

I keep seeing DS4 and in order my brain interprets it as Dark Souls 4 (sadface), DualShock 4, Deep Seek 4.

1 more reply

layoric2d ago

antirez2d ago

zozbot2342d ago

segmondy1d ago

Curious why you went this route, don't you think you could have achieved near this performance 80%+ or more within llama.cpp?

MrBuddyCasino2d ago

Prefill is faster on the M5s, the older generations are a bit weak.

visarga2d ago

antirez2d ago

DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.

bel82d ago

And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

antirez2d ago

brcmthrowaway2d ago

Why is this the case?

Are there any architectures that don't rely on feeding the entire history back into the chat?

Recurrent LLMs?

kristianp2d ago

Hmm, I'm unable to order more than 96GB RAM for a Mac studio, even with the M3 ultra or M4 Max. Is this au specific? However with the MacBook Pro I can specify 128GB with the M5 Mac.

https://www.apple.com/au/shop/buy-mac/mac-studio

Joeri2d ago

It’s not just AU: https://9to5mac.com/2026/05/05/apples-most-powerful-mac-stud...

They’ve dropped all the mac studio configs higher than 96 gb, as well as the base mac mini. They’re also rumored to be considering taking the Neo base config off the market.

This seems to be how they’re dealing with supply constraints for fab capacity and RAM.

Terretta1d ago

Difficult to believe this memory is made of unobtanium.

Maybe Apple would rather not price it at all than experience blowback for either gouging or lack of inventory.

smcleod2d ago

The studio is really old now. The new one will drop at some point no doubt with more memory options. the 128GB M5 max MBP is great though

Terretta1d ago

And yet, aside from offering 512GB, that really old Studio Ultra M3 LLMs faster (especially sustained) than the new M5 Max.

Havoc2d ago

Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff

zozbot2342d ago

Havoc2d ago

>It could run viably with SSD offload on Macs with very little memory

Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range

Lovely as modern nvmes are they're not memory

zozbot2342d ago

amunozo2d ago

unshavedyak2d ago

Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.

edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.

amunozo2d ago

actsasbuffoon2d ago

Apple just dropped the 128GB option as well.

PhilippGille2d ago

https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...

amunozo2d ago

syntaxing2d ago

How has opencode go been for you? Worth changing over from Claude pro?

DefineOutside2d ago

amunozo2d ago

Given the price, extremely satisfied, especially thanks to DeepSeek V4 Flash that makes it last forever. I use it on top of my 20$ Codex which is great but tokens last nothing.

ZeroGravitas2d ago

Did I miss a simple motivating benchmark or goal?

Presumably you can work it out based on the numbers given if you have the relevant comparison values.

fgfarben1d ago

On both the llama.cpp based version and the custom Metal version, the model forgets how to use tools somewhere around the 50,000 token mark.

dejli2d ago

The beaty of it, that you can clone and make it, and it just works, no python shenanigans, what a blessing for this eco system.

tmaly1d ago

The intro was the best part of the README in my opinion. The rest of the README looks and feels AI generated. I am guilty of this same thing with README files.

shivnathtathe2d ago

Been working on local-first LLM observability for exactly this use case — tracing local model pipelines without sending data to cloud. Happy to share if anyone's interested.

sourcecodeplz2d ago

Great project!

This is also a fine example of a vibe-coded project with purpose, as you acknowledged.

octocop2d ago

Finally someone who pays proper respect to GGML ecosystem.

npgraph1d ago

Any direct TPS comparison to Ollama?

1 more reply

brcmthrowaway2d ago

How does this compare with oMLX?

happyPersonR2d ago

So just gonna ask a question, probably will get downvoted

I know this is flash, but….

But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?

Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK

Sidenote: shout out antirez love my redis :)

AlotOfReading2d ago

This is built atop a tower of stuff people built with profiling and performance-oriented design.

happyPersonR2d ago

Building this into agentic dev workflows (subject to token/time constraints) is something I spent a lot of time doing at work. I actually am kind of proud of that hahah

But you’re right I agree

In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen

Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.

A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space

I’m happy to see more stuff like this :)

TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?

liuliu2d ago

DSv4 generates much faster on NVIDIA class hardware. It is just a very efficient model.

wmf2d ago

Every lab has a bunch of people doing nothing but optimizing.

fgfarben2d ago

The world is not China.

j / k navigate · click thread line to collapse