Anyways, so excited for an open-weight model and I hope it performs well. I’ll be testing this ASAP.
I've had pretty good success with LLMs after putting in place metrics to measure true complexity (not cyclomatic), and automatically pushing back everything until the added complexity is within reason for the feature.
llm consortium save cns-glm -m glm-5.2 -n 5 --arbiter mercury-2 --judging-method rank
llm consortium save cns-kimi -m k2.6 -n 5 --arbiter mercury-2 --judging-method rank
llm consortium save cns-meta-glm-kimi -m cns-glm -m cns-kimi --arbiter mercury-2 --judging-method synthesis
Now when I prompt cns-meta-glm-kimi it will pick the best of five from kimi and glm before creating a synthesis from the two winners.I did some benchmarks recently of how well various models find security vulnerabilities, and then follow up testing of the judging process of whether the models found the right bug and whether other bugs it reported were false positives or legitimate other bugs. A committee of good-not-great models (DeepSeek, MiMo, Gemma 4) cannot replicate the accuracy of Opus by itself. Even when all three of the other models disagreed with Opus, Opus was almost always the one that was actually right.
It's an interesting area for research. And, a model that's very fast can make a lot more attempts at a solution, and in cases where there is an unambiguous "right" solution that can be proven by some sort of static rule, "very fast" may be a useful characteristic. Small classification problems, where you need to make thousands of decisions about some specific aspect of a large corpus of data, seems like a sweet spot for a model like Mercury.
Claude is better for extremely complicated, large codebases where its slower response time might be a good trade-off for the complexity of the task. Antigravity and other fast models works so much better for smaller projects where you want a "flowy" code, run, debug cycle.
Regardless of speed, use the LLM to eliminate the need for boilerplate rather than just creating more code faster.
> if it's too slow, you just stay in that goddamn async death loop
Things get slow when you're ballooning the size of your code, files, design and architecture, and things get more involved and complicated, piling fast hacks on top of fast hacks and everything get brittle.
Slow is fast, longer-term anyways.
10-20 seconds times a couple turns on a new feature isn't bad. Kimi is also similarly fast if not faster.
I do agree with smaller models for more constrained/routine tasks though.
Today, even the dumbest AI agents can trivially loop through the final dance to get compilation, and often unit tests (depending on scope of failure). Big SOTA agents have OK code quality, but if left unattended or unchecked will still generate pretty sloppy repos after a while. That’s true even when using models like Opus which is ridiculously expensive in comparison.
When using the models in this fast “pair programming” style, I find that I (the human) mostly do all the “plan and think” work, and usually point the smaller agent towards specific files/directories, with specific targeted changes. It’s slower than 1-shot prompting an entire feature, but slightly faster than doing it manually, and I find the code is less “slop” because the changes are smaller and more human. It retains the agentic benefits of handing imports, compilation iteration, etc and can do basic cross-file plumbing. It’s also cheap and fast to do iterations like “wait make that method static” or “let’s update this to use <other util class>” and things like that. When the agent is slow to make localized edits, I find I’m less likely to push for minor nit-picks and style updates.
An LLM's decoder computes tokens one-at-a-time because attention has to account for each previous token. The existing LLM decoders scale well when you have enough load to batch many inferences together. Diffusion of limited benefit there. On edge you have a different problem: your inference accelerator is starved while sloshing GB of weights back and forth from RAM. That's because the consumer RAM like LPDDRx/GDDRx is lower bandwidth than HBM, and the requests are serial so you can't batch compute common weights. Diffusion can compute tokens in parallel which relieves the memory bandwidth bottle neck.
You can use diffusion with attention, and this model does in fact use attention
All these efficiency improvements seem likely to be really important to the future of AI, though, as the money starts flowing the other direction. The days of subsidized tokens to try to lock people into specific ecosystems are coming to an end, and we're going to have to start paying what it actually costs.
The companies that figure out how to make it cost-effective to run really smart models are the ones that will win. DeepSeek costs an order of magnitude less than GPT 5.5 or Opus 4.8. It's worse than either, but not catastrophically worse. I'll happily pay ten times as much for the best coding model, because it saves enough human time to justify it, but not a hundred times as much, which is where things seem to be heading (GPT 5.5 Pro cost over 200 times as much as DeepSeek in some benchmarks I recently did, and ~30 times as much as Opus 4.8).
Qwen 3.6 is maybe better for code (though I'm beginning to think otherwise after some benchmarking I've been doing, where Gemma 4 has been overperforming expectations), but for just about anything else, Gemma 4 is the one.
If they're gimping it, why is nobody else making a better one that small?
The nice thing about DeepSeek is its ability to be run on local hardware, with no API costs involved. If you care deeply about that, then it being a bit worse than Opus or GPT isn't really a problem.
After the AI bubble bursts, it will be the likes of Google that come out the other side still wearing their shirts. I think this bubble is out to scalp some giants.
The thing is, diffusion models perform somewhat worse than autoregressive on text. So you lose some performance.
Speed is the big advantage. Autoregressive when doing local inference is mostly memory bound; you're doing one token at a time, for each token you need to load all weights. MTP helps a bit by allowing you to draft tokens in a smaller model and then verify them in parallel with the larger model, allowing you to do a few computations for every memory load, but because you're still doing tokens sequentially and need to discard invalid drafted tokens, you can only get so much speedup.
For hosted models, however, you can batch many token generations together, fully utilizing all of the compute while no longer being bottlenecked on memory bandwidth. So they are already operating at close to max efficiency.
So, diffusion kind of loses its beneifit in hosted models. Sure, maybe you could pay more to have slightly lower latency responses by doing diffusion for one user at a time instead of autoregressive for many in parallel. But given that it also reduces accuracy, it's hard to see where you'd really want that. Unless they're able to bring it up to par with autoregressive, it seems like it's a bit of a dead out outside of local models where you're generally just doing one thing at a time.
My immediate thought - this performs slightly worse than the autoregressive gemma equivalent, but it may also let me functionally run better models in diffusion variants.
Ex - I can run 70b-120b autoregressive models locally right now, but I get ~5-15t/s, which just isn't fast enough for serious work.
Which caps me down in the 20-36b models (ex - gemma4) where I can get 100+t/s on the same hardware.
So the question becomes - does the quality drop from a diffusion model outweigh the quality bump from using a larger model?
Because if not... sounds like diffusion models have a lot of space to thrive.
---
Sadly - if they can't be hosted profitably, I question whether this space will actually be explored.
(I got it to draw a pelican: https://tools.simonwillison.net/markdown-svg-renderer#url=ht... )
And instead of reporting tps, you would - of course! - report pfps (pelican frames per second).
...was quite surprising the result!
"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"
So the diffusion process takes more GFLOPs, if you have enough users you can already balance memory and compute.
I ran this question by ChatGPT and Claude and they came up with limitations in GRPO RLVR, but I'm not sure..
How exactly does this work with CoT?
What didn't work reliably was specifically write tool calls and this is not resolved by the pull request. But as far as I understand the problem is not the inference framework but the root issue is that DiffusionGemma emits incorrect JSON.
When `content` contains `, ` inside a string value, the decoder splits there and emits the remainder as a nonsensical JSON key. So `{"path": "f.py", "content": "def f(x, y):\n return x"}` becomes `{"path": "f.py", "content": "def f(x", "y):\n return x": ...}`.
I wondered if the JSON issue might be related to quantization and tested the BF16 variant of google/diffusiongemma-26b-a4b-it via NVIDIA NIM. The model did not show the delimiter-splitting bug. It did however have a quote-handling issue. Among others it duplicated tripple quotes (`"""..."""` becomes `""""""...""""""`).
> Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.
Okay, so Gemma 4 26B is a MoE model that's really fast on my 24 GB GPU using ollama. This sounds like speculative decoding but I don't think that works with MoE models? It's hard to keep up with all this when it's not your job to keep up with it.
The mechanism isn't the same as speculative decoding. Speculative decoding happens sequentially and (usually) a couple of tokens at a time; diffusion doesn't, and does blocks of text at once. I haven't read the collateral yet but my assumption would be that it's trained to keep the specific experts stable across a diffusion block.
Does not need opus level to write, and easy to iterate on.
Then you will be able to achieve Jevons Paradox and enjoy the same “productivity gains” without paying for these extortionate token prices by closed model providers or have it as cheap as possible.
And especially, no silent nerfing of the model.
Next year, and the year after, Fable, GPT 5.5 and Gemini 3.5 will feel quite ordinary. And perhaps even within reach of a prosumer running models locally.
E.g. run your normal autoregressive LLMs (with MTP whatever, as you like), then run a single diffusion pass over the result, and observe any tokens that diffusion thinks are unlikely.
Then prompt the autoregressive llm with some structured reasoning "<think>Is <diffusion unlikely part> an error? .."
Because the diffusion model is so structurally different perhaps it makes different errors such that this would provide gains even vs running distinct autoregressive LLMs which often make the same errors.
The same argument could apply for RWKV but it would be relatively expensive to apply it as a second pass on a big block of output, while it seems like a diffusion model would be cheaper.
The video demo of the svg sword is an interesting example of what is so interesting about diffusion models: it's not just putting one token after another to make edits to a file. It's skipping around, it's re-editing previous lines. I feel like forcing it to write too calls is maybe not its best nature.
I feel like perhaps instead of a monolithic edit file tool call, perhaps the diffusion model would be better suited to posting a change stream, a series of edit ops, across multiple files.
Compared to autoregressive decoding, diffusion is huge for local MoE inference because of the improved token generation efficiency, especially for normal GPU + ram offload setting.
However, there are models which are better positioned on the performance vs memory pareto front, i.e. dense models, so I'll just wait.
P.S. QAT is really something as it reduces the performance fluctuations compared to the normal one. Thanks again.
The bidirectionality could be a big deal: being able to refine a sentence with both left and right context feels closer to how editing/thinking actually works than committing to each token forever.
Maybe the current models aren’t good enough yet, but the direction feels important.