Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.
The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.
The inference engines in use already include different backend building blocks optimized for different hardware.
While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.
There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.
[1] https://codegolf.stackexchange.com/questions/215216/high-thr...
I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.
Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.
I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.
My effort is called shady-thinker and is on github at github.com/tmzt/shady-thinker.
This was inspired in part by Antirez's earlier work with C kernels as well as other efforts to support in-browser LLMs. I've adapted them to Rust and the wgpu library.
Gemma 4 is also the next likely target (with the MTP work) as I'm experimenting with local AI agents.
I'd love to see what you've done to improve prefill and decode even if its not directly applicable.
One difference, I'm using MLX and GPTQ 4bit quants including AutoRound with safetensors as my shader pipeline is pretty much fixed for each model, ggml just adds unnecessary complexity.
Even if not perfect, if you publish on GH or HF, some other agent can maybe start there and not from zero. I did this for Ling-2.6-flash (107B-A7B4 MoE) that's the biggest llm I can ran for practical use on the other h/w I got for local llms (M2 Max). Even if MTP is not working well, still improvement on the current llama.cpp that does not run Ling-2.6-flash at all. This - https://huggingface.co/inclusionAI/Ling-2.6-flash/discussion.... The 4-bit quants are at https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, the branch is at https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas....
I think llama.cpp could have done a much better job supporting PC. Sure, some of it us due to bad vendor support but with so many users I am surprised we don't see more optimized inference on standard PCs
### Diagnosing parallelism pathologies (L1)
*Grid occupancy:* - `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)? - < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization will NOT help. - 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure. - 1.0-4.0 = healthy; micro-optimization can help.
*Within-block distribution:* - Does the kernel do useful work across all threads, or is there an `if (threadIdx.x == 0)` gate around a serial top-k, reduction, or scan? For c=1 decode, many kernels can't grow the grid, but they can always parallelize inside the block. - `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a strong secondary signal of the within-block pathology.
*Router top-k (within-block fix)*: - Kernel: `qwen35_router_select_kernel` @ c=1 decode - Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)` gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays. - Fix: warp-shuffle parallel argmax across the whole block + `__shared__` top_vals buffer eliminating the spill. - Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.
Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.
Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.
You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.
48 gb is enough for a capable LLM.
Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.
The good: It succeeded with discovering, applying edits and writing a test for a small task I gave it. The bad: It could not address a small nitpick I had. The ugly: It hallucinated a conversation about "The Duck" that I had with it simultaneously while trying to solve another problem. I can only imagine it's one of examples in the initial Claude Code prompt:
--cut-- However, the user's query is "Can you track these 3 videos here?" which seems unrelated. Perhaps the user is asking if I can track the progress of three videos they are working on?
Let me re-read the user's message. The user said "Source Code" and "The Agent" and "The Duck", it could be video titles. And they are asking if I can track these 3 videos.
?? That doesn't make sense in the context. Could there be two different conversations? --cut--
Edit: Caching story makes a lot more sense for regular usage: > Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.
Also, can the engine support transparent mmap use for fetching weights from disk on-demand, at least when using pure CPU? (GPU inference might be harder, since it's not clear how page faults would interact with running a shader.)
If the latter test is successful, next would be testing Macs with more limited RAM, first running simple requests (would be quite slow) then larger batches (might be more worthwhile if one can partially amortize the cost of fetching weights from storage, and be bottlenecked by other factors).
This is probably far from the raw intelligence provided by cloud providers.
Still, this shines more light on local LLMs for agentic workflows.
Are there any architectures that don't rely on feeding the entire history back into the chat?
Recurrent LLMs?
They’ve dropped all the mac studio configs higher than 96 gb, as well as the base mac mini. They’re also rumored to be considering taking the Neo base config off the market.
This seems to be how they’re dealing with supply constraints for fab capacity and RAM.
Maybe Apple would rather not price it at all than experience blowback for either gouging or lack of inventory.
Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range
Lovely as modern nvmes are they're not memory
Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.
edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.
https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...
I'm assuming this is faster, and/or lets you run a bigger, smarter model than just using the generic tool chain, but it doesn't spell out the level of existing improvements over that baseline or expected improvements as far as I can see?
Presumably you can work it out based on the numbers given if you have the relevant comparison values.
This is also a fine example of a vibe-coded project with purpose, as you acknowledged.
I know this is flash, but….
But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?
Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK
Sidenote: shout out antirez love my redis :)
That said, I've found that most corporate environments are unintentionally hostile to this kind of optimization work. It's hard to justify until the work is already done. That means you often need people with the skills, means, and motivation to do this that are outside normal corporate constraints. There aren't many of those.
But you’re right I agree
In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen
Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.
A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space
I’m happy to see more stuff like this :)
TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?