DiffusionGemma: 4x Faster Text Generation (opens in new tab)

(blog.google)

327 pointsmeetpateltech15d ago88 comments

88 comments

78 comments · 27 top-level

vineyardmike15d ago· 16 in thread

Recently I had switched to OpenCode to try out many of the Non-US-Frontier-Labs models. My unexpected favorite model to use was Mercury (a diffusion model). Not because it was “smart” but because it was stupid fast. It was more of a pair-programming experience instead of the SOTA agentic experience of prompting and waiting. Honestly, it was also way more fun and brought back some of the pre-AI coding experience while still getting some benefits of AI. It felt less of a slot machine where you prompt, wait, and hope it went in the right direction. It made me even use the tiny models like Gemini Flash Lite and GPT Mini/Nano more too.

Anyways, so excited for an open-weight model and I hope it performs well. I’ll be testing this ASAP.

onlyrealcuzzo15d ago

If you can run your tests fast and cheaply, and have metrics that show what bad/sloppy code is that are cheap & fast to generate, a worse fast model can outperform a far better far slower model if you value time...

I've had pretty good success with LLMs after putting in place metrics to measure true complexity (not cyclomatic), and automatically pushing back everything until the added complexity is within reason for the feature.

bee_rider15d ago

How do you measure “true” complexity? Cyclomatic seems a bit… I dunno, artificial? Blunt? But it has the benefit of being defined.

1 more reply

fridder14d ago

I wonder if a dedicated client or mode in a client would provide some benefits. Might also be interesting to do adversarial stuff too where it argues with itself or another model

Daishiman15d ago

What metrics have you found useful?

yeodev15d ago

I wonder how much this will impact locally used models for coding. I can imagine using diffusion models that are x-times faster than Qwen or Gemma 4 - where I have to do more "pre-ai" work which is a good thing and can have a very fast, very cheap model to work with locally. I assume since it doesn't do heavy computing for a long time that it's cheaper to run on local hardware as well?

irthomasthomas15d ago

Mercury-2 is amazing. I am using it frequently as the arbiter in llm-consortium The context window is relatively small, so to make it work with larger consortiums I can construct a recursive sort-of meta consortium like this:

  llm consortium save cns-glm -m glm-5.2 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-kimi -m k2.6 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-meta-glm-kimi -m cns-glm -m cns-kimi --arbiter mercury-2 --judging-method synthesis

Now when I prompt cns-meta-glm-kimi it will pick the best of five from kimi and glm before creating a synthesis from the two winners.

SwellJoe15d ago

I've found the average output of many suboptimal models is still suboptimal, especially when it comes to judging the accuracy/correctness of the work of other models.

I did some benchmarks recently of how well various models find security vulnerabilities, and then follow up testing of the judging process of whether the models found the right bug and whether other bugs it reported were false positives or legitimate other bugs. A committee of good-not-great models (DeepSeek, MiMo, Gemma 4) cannot replicate the accuracy of Opus by itself. Even when all three of the other models disagreed with Opus, Opus was almost always the one that was actually right.

It's an interesting area for research. And, a model that's very fast can make a lot more attempts at a solution, and in cases where there is an unambiguous "right" solution that can be proven by some sort of static rule, "very fast" may be a useful characteristic. Small classification problems, where you need to make thousands of decisions about some specific aspect of a large corpus of data, seems like a sweet spot for a model like Mercury.

1 more reply

evilturnip15d ago

I get exactly what you mean. After getting frustrated with how slow Claude was on my personal projects, I switched to Google Antigravity with Flash models and the speed difference is huge. I feel more in the flow and just more focused on the task. I did not realize how much a difference speed can make.

Claude is better for extremely complicated, large codebases where its slower response time might be a good trade-off for the complexity of the task. Antigravity and other fast models works so much better for smaller projects where you want a "flowy" code, run, debug cycle.

bpavuk15d ago

YESSSS!!! speed is THE way! I like my boilerplate POJOs/data classes generated at breakneck pace of 300+ tok/s, Flash-Lite is more useful than GPT-5.5 for me this way. if it's too slow, you just stay in that goddamn async death loop

embedding-shape15d ago

> I like my boilerplate POJOs/data classes generated at breakneck pace of 300+ tok/s

Regardless of speed, use the LLM to eliminate the need for boilerplate rather than just creating more code faster.

> if it's too slow, you just stay in that goddamn async death loop

Things get slow when you're ballooning the size of your code, files, design and architecture, and things get more involved and complicated, piling fast hacks on top of fast hacks and everything get brittle.

Slow is fast, longer-term anyways.

1 more reply

elxr15d ago

For boilerplate, yeah. But when asking research or exploratory questions, or weighing whether a feature is well designed, or asking "can I implement _x_ feature using these libraries without introducing unnecessary complexity", then GPT-5.5 medium is still fast enough.

10-20 seconds times a couple turns on a new feature isn't bad. Kimi is also similarly fast if not faster.

I do agree with smaller models for more constrained/routine tasks though.

1 more reply

fittingopposite14d ago

Mercury is a US LLM from https://www.inceptionlabs.ai/

desireco4215d ago

Wow... I forgot about that. Mercury is brutal. I had him review lint errors and the speed is just insane

skybrian15d ago

Could you say more about how you use it? What does your workflow look like?

vineyardmike15d ago

Imagine you’re entirely pre-AI… to do some work, you read code, think, then write some code across a number of files. Usually then a small dance with compilation/unit tests to address anything broken. Along the way, you use your human judgement on style and quality, and midway through your change you might refactor something based on learned best practices (eg, when to use a static method, or helper class).

Today, even the dumbest AI agents can trivially loop through the final dance to get compilation, and often unit tests (depending on scope of failure). Big SOTA agents have OK code quality, but if left unattended or unchecked will still generate pretty sloppy repos after a while. That’s true even when using models like Opus which is ridiculously expensive in comparison.

When using the models in this fast “pair programming” style, I find that I (the human) mostly do all the “plan and think” work, and usually point the smaller agent towards specific files/directories, with specific targeted changes. It’s slower than 1-shot prompting an entire feature, but slightly faster than doing it manually, and I find the code is less “slop” because the changes are smaller and more human. It retains the agentic benefits of handing imports, compilation iteration, etc and can do basic cross-file plumbing. It’s also cheap and fast to do iterations like “wait make that method static” or “let’s update this to use <other util class>” and things like that. When the agent is slow to make localized edits, I find I’m less likely to push for minor nit-picks and style updates.

andai15d ago

So you're making smaller edits?

samuelknight15d ago· 4 in thread

Some of these comments miss the advantage of diffusion. This is will have a big impact on edge devices, such as your phone or the GPU in your computer.

An LLM's decoder computes tokens one-at-a-time because attention has to account for each previous token. The existing LLM decoders scale well when you have enough load to batch many inferences together. Diffusion of limited benefit there. On edge you have a different problem: your inference accelerator is starved while sloshing GB of weights back and forth from RAM. That's because the consumer RAM like LPDDRx/GDDRx is lower bandwidth than HBM, and the requests are serial so you can't batch compute common weights. Diffusion can compute tokens in parallel which relieves the memory bandwidth bottle neck.

zozbot23415d ago

Edge devices don't just have limited memory bandwidth though, they also have very limited compute. To the extent where you don't actually need all that much batching to saturate their viable compute and run into obvious thermal/power limits. (It's just not true that "requests are inherently serial" in edge inference; any time you have multiple requests (i.e. "chats") in flight, batching becomes applicable if you have enough memory capacity for the KV caches.) I'm not sure how diffusion models are supposed to help there, if they simply take more compute for lower-quality outcomes and a dubious saving in memory bandwidth.

zozbot23414d ago

Forgot to mention it previously, but this might be a good model for a narrow slice of midrange systems that really are more skewed towards compute than memory bandwidth, but also don't have enough memory capacity to effectively use batching. (E.g. top-of-the-range consumer GPUs, or earlier generations of datacenter GPUs.) Although you do also compete with things like MTP there, which is targeting a similar tradeoff, or with denser models featuring a similar amount of total parameters. So I'd say that the jury is very much still out, even in that narrow space. Diffusion models are also apparently very hard to scale to a hundred-billion or trillion parameter count, since the way you train them is completely different to the usual one-token-at-a-time models.

BarakWidawsky15d ago

You’re mostly right but conflating attention with autoregressive/causal which is the real issue that prevents you from using more compute

You can use diffusion with attention, and this model does in fact use attention

samuelknight15d ago

Yes, I should have said autoregressive.

SwellJoe15d ago· 4 in thread

Google keeps flexin'. It's surprising that Gemini isn't more competitive against Claude or OpenAI models for code and agentic use, because it's clear Google still has some of the best AI people in the business. But, I guess Google is focused on stuff that runs on phones and near-realtime use cases, rather than the big thinky LLMs.

All these efficiency improvements seem likely to be really important to the future of AI, though, as the money starts flowing the other direction. The days of subsidized tokens to try to lock people into specific ecosystems are coming to an end, and we're going to have to start paying what it actually costs.

The companies that figure out how to make it cost-effective to run really smart models are the ones that will win. DeepSeek costs an order of magnitude less than GPT 5.5 or Opus 4.8. It's worse than either, but not catastrophically worse. I'll happily pay ten times as much for the best coding model, because it saves enough human time to justify it, but not a hundred times as much, which is where things seem to be heading (GPT 5.5 Pro cost over 200 times as much as DeepSeek in some benchmarks I recently did, and ~30 times as much as Opus 4.8).

halJordan14d ago

Google is clearly gimping the gemma models. There is a 122b gemma 4 that was never released, but was a part of the announcement tweet. Plus they weren't going to release MTP until people figured out they're running it on the pixels

SwellJoe14d ago

I dunno about that. Gemma 4 is probably the best model for general self-hosted use for almost everyone that doesn't have a data center in their basement. They didn't have to release it at all, and they didn't have to release speculative decoding drafters, and they didn't have to release the QAT version of the models that makes the 4-bit quantization perform very close to the bigger versions, and can run in 32GB. I'd love a 122B version of it, and I didn't realize they'd ever announced one was coming (though I remember there being speculation about it). But, also, I'm happy they're doing so much with it. They've got all the sizes covered, it has great prose for an LLM, better prose than even most larger models, it's got great audio and vision, and broad language support. As self-hosted general purpose models go, it's the total package.

Qwen 3.6 is maybe better for code (though I'm beginning to think otherwise after some benchmarking I've been doing, where Gemma 4 has been overperforming expectations), but for just about anything else, Gemma 4 is the one.

If they're gimping it, why is nobody else making a better one that small?

zozbot23415d ago

Fable's costs are twice Opus' and it's clearly quite competitive with GPT-Pro, so that seems like it might be a good option for you if the trigger-happy safeguards aren't too much of a problem. Google has their own "Deep Research" option in this space which seems to work well.

The nice thing about DeepSeek is its ability to be run on local hardware, with no API costs involved. If you care deeply about that, then it being a bit worse than Opus or GPT isn't really a problem.

bArray15d ago

I think Google will win out in the end. They are concentrating on what matters, performance per watt, and performance per dollar. They are building their own inference hardware and are working towards edge-computing which removes latency and compute overheads. These big LLMs are not yet cost effective, Google is just letting them burn their investment funds to "sell" to consumers at below cost.

After the AI bubble bursts, it will be the likes of Google that come out the other side still wearing their shirts. I think this bubble is out to scalp some giants.

kkukshtel15d ago· 4 in thread

I think this is the future. The sort of left-field rumble that turns into a quake in 5 years.

famouswaffles15d ago

Almost certainly not if things remain as they are. The reason there's been little traction is the quality gap between diffusion and autoregressive models is pretty stark. I mean just look at the benchmarks here. Large dropoffs, with the hardest benchmarks seeing the largest drops. On top of that, almost all the speed benefits of diffusion models become negated at scale. So this is only attractive for local model development and almost everyone training local models still care about pound for pound quality and inference efficiency at scale.

regularfry15d ago

It's fast enough that "ask it twice and pick the best" should still come out ahead performance-wise. I don't know how much that would close the quality gap by, but it's worth a play.

lambda15d ago

This may be the future of local models.

The thing is, diffusion models perform somewhat worse than autoregressive on text. So you lose some performance.

Speed is the big advantage. Autoregressive when doing local inference is mostly memory bound; you're doing one token at a time, for each token you need to load all weights. MTP helps a bit by allowing you to draft tokens in a smaller model and then verify them in parallel with the larger model, allowing you to do a few computations for every memory load, but because you're still doing tokens sequentially and need to discard invalid drafted tokens, you can only get so much speedup.

For hosted models, however, you can batch many token generations together, fully utilizing all of the compute while no longer being bottlenecked on memory bandwidth. So they are already operating at close to max efficiency.

So, diffusion kind of loses its beneifit in hosted models. Sure, maybe you could pay more to have slightly lower latency responses by doing diffusion for one user at a time instead of autoregressive for many in parallel. But given that it also reduces accuracy, it's hard to see where you'd really want that. Unless they're able to bring it up to par with autoregressive, it seems like it's a bit of a dead out outside of local models where you're generally just doing one thing at a time.

horsawlarway15d ago

I'm particularly curious to know how this plays out, and I seriously hope that more labs focus on diffusion models for text usage.

My immediate thought - this performs slightly worse than the autoregressive gemma equivalent, but it may also let me functionally run better models in diffusion variants.

Ex - I can run 70b-120b autoregressive models locally right now, but I get ~5-15t/s, which just isn't fast enough for serious work.

Which caps me down in the 20-36b models (ex - gemma4) where I can get 100+t/s on the same hardware.

So the question becomes - does the quality drop from a diffusion model outweigh the quality bump from using a larger model?

Because if not... sounds like diffusion models have a lot of space to thrive.

---

Sadly - if they can't be hosted profitably, I question whether this space will actually be explored.

1 more reply

simonw15d ago· 3 in thread

NVIDIA are hosting a free endpoint for this one, details at https://build.nvidia.com/google/diffusiongemma-26b-a4b-it - you have to create an account and (I think) verify a phone number too.

(I got it to draw a pelican: https://tools.simonwillison.net/markdown-svg-renderer#url=ht... )

dr_kiszonka15d ago

Maybe with very fast models you could request animation frames, e.g., frame 1) right foot at 12, left foot at 6; frame 2) right foot at 3, left foot at 9, etc.?

And instead of reporting tps, you would - of course! - report pfps (pelican frames per second).

alfirous15d ago

I register few weeks ago, the account still not verified, despite following the procedure. Can't use API if the account not verified.

ramses014d ago

Thought of you this afternoon, "after you click the record button can you make a 'boop, boop, boop, clack!' like a lead-in from a from a clapboard (using web audio synthesis apis)?"

...was quite surprising the result!

minimaxir15d ago· 3 in thread

A few days ago I was just thinking that Google never talked about their diffusion text generation model after demoing it at I/O a year ago. The rumor is that it was too expensive to run, but with the provided chart using the same 1x H100 hardware and comparing DiffusionGemma to regular Gemma, that shouldn't be the case. I'm curious what the downside for this speed is here aside from being slightly weaker than Gemma.

ac2915d ago

> I'm curious what the downside for this speed is here

"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"

GaggiX15d ago

Well with a standard autoregressive model you can generate for example 256 tokens at once if you have 256 users, with this approach you can generate 256 tokens for a single user but you need several forward steps.

So the diffusion process takes more GFLOPs, if you have enough users you can already balance memory and compute.

minimaxir15d ago

Batching is a fair counterpoint.

LarsDu8815d ago· 3 in thread

Does anyone know of the current intrinsic limitations with Diffusion text models compared to autoregressive?

I ran this question by ChatGPT and Claude and they came up with limitations in GRPO RLVR, but I'm not sure..

yorwba15d ago

The intrinsic limitation of text diffusion is that natural text contains serial dependencies where a word at the beginning of the text strongly influences what comes later, and if there is a long enough dependency chain within a diffusion block, the small number of diffusion steps may not be enough to resolve all dependencies, so that you end up with incoherent output.

LarsDu8815d ago

The obvious solution is to simply do more steps for larger sequences though, right?

How exactly does this work with CoT?

robkop15d ago

CoT legibility largely disappears which is quite concerning from a safety perspective

najarvg15d ago· 3 in thread

Do diffusion models support tool calls? If so is the tool call support on par with autoregressive models or worse? (edited spelling)

emilfihlman15d ago

Any text generation model can easily be made to support tool calls.

wsintra202214d ago

omlx.server - WARNING - POST /v1/chat/completions -> 400: Tool calling is not supported with diffusion models.

loopkid12d ago

Pull request #1837 that enables tool calls on supported diffusion models was merged as 7c1971e today. I previously tested mlx-community/diffusiongemma-26B-A4B-it-8bit on a custom patched version of omlx in the Zed Agent Panel. The majority of the tool calls worked.

What didn't work reliably was specifically write tool calls and this is not resolved by the pull request. But as far as I understand the problem is not the inference framework but the root issue is that DiffusionGemma emits incorrect JSON.

When `content` contains `, ` inside a string value, the decoder splits there and emits the remainder as a nonsensical JSON key. So `{"path": "f.py", "content": "def f(x, y):\n return x"}` becomes `{"path": "f.py", "content": "def f(x", "y):\n return x": ...}`.

I wondered if the JSON issue might be related to quantization and tested the BF16 variant of google/diffusiongemma-26b-a4b-it via NVIDIA NIM. The model did not show the delimiter-splitting bug. It did however have a quote-handling issue. Among others it duplicated tripple quotes (`"""..."""` becomes `""""""...""""""`).

bachmeier15d ago· 2 in thread

> DiffusionGemma reverses this inefficiency. Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously. By giving the computer's processor a larger chunk of work at once, DiffusionGemma utilizes your hardware to its full potential. It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously.

> Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.

Okay, so Gemma 4 26B is a MoE model that's really fast on my 24 GB GPU using ollama. This sounds like speculative decoding but I don't think that works with MoE models? It's hard to keep up with all this when it's not your job to keep up with it.

regularfry15d ago

This is a different model with, confusingly, approximately the same number of params as the existing gemma4 MoE. Unclear from a quick scan whether one was trained somehow from the other.

The mechanism isn't the same as speculative decoding. Speculative decoding happens sequentially and (usually) a couple of tokens at a time; diffusion doesn't, and does blocks of text at once. I haven't read the collateral yet but my assumption would be that it's trained to keep the specific experts stable across a diffusion block.

bachmeier15d ago

Thanks. I found this other comment that links to a very thorough explanation: https://news.ycombinator.com/item?id=48479042

2 more replies

SkitterKherpi15d ago· 2 in thread

It is cool but local models while okay already feel noticeably worse than even the cheapest APIs so I can't see myself sacrificing even a little bit of their quality for speed. I'm sure it's worth it for some usecases, curious to hear specific ones that people are already planning to deploy to production.

Mashimo15d ago

Maybe writing / bootstraping unit tests?

Does not need opus level to write, and easy to iterate on.

SkitterKherpi15d ago

I can see it but even if I do that for something like tests I'd still eat the time cost of the normal Gemma for 10% extra performance. And further, if you switch between the fast and normal Gemma for different tasks you eat the big time cost of loading the other model (and maintaining both in the first place).

xnx15d ago· 2 in thread

Is the diffusion approach any use in Multi-Token Prediction (MTP) drafters? https://blog.google/innovation-and-ai/technology/developers-...

fcanesin15d ago

Yes, DFlash is currently a SOTA speculative decoding method that Xiaomi just used in their MiMo model for >1000tkps

doctorpangloss15d ago

MTP is a training optimization. Drafting requires verification, and verification is the full model inference. Speculative decoders are the name for the inference time optimization, that is more like a verifier that is a smaller model.

petercooper15d ago· 2 in thread

I'm not getting anywhere near the speeds advertised on my 3090 Ti, alas, but it's fun watching it "fill out" its answers. I did Simon's "SVG pelican on a bicycle" test on it and the result was quite minimalistic but fit the brief: https://gist.github.com/peterc/7672e74ec1437945e5fca5ce2c1c9... -- this was on the Q4 quant running on patched llama.cpp. I will be interested to see if Simon's looks much different.

osanseviero14d ago

Hi! What implementation are you using? Right now VLLM is the one recommended. llama.cpp is in an early draft

petercooper14d ago

Yeah, the patched llama.cpp. The reason is I saw that using the Q4 quant on vLLM is discouraged and the int8 won't fit on my 3090 Ti, but I could certainly give it a go. I also skipped Transformers as it needs to download the full weights and quantize them locally and I didn't fancy waiting for a 50GB download.

schmorptron15d ago· 1 in thread

What would a diffusing reasoning model look like? have a pre-defined length [thinking] block that gets diffused over a long time, and then the final output block uses what is in that thinking block as part of its input? And how do diffusion models decide the output length in the first place, is it a pre-set parameter? or does it diffuse an [end] token into the middle somewhere?

schmorptron15d ago

got one answer by reading the rest of the comments, makes sense that the diffusion process is inherently reasoning-like: https://www.inceptionlabs.ai/blog/introducing-mercury-2

roosgit15d ago· 1 in thread

Can LoRAs be used to increase the quality of these diffusion models? Nvidia mentions something about this https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B#inf...

pilooch15d ago

Yes, full ft or lora https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guid...

rvz15d ago· 1 in thread

We need more local open weight models that are performant and just as good (or good enough) as the best frontier ones.

Then you will be able to achieve Jevons Paradox and enjoy the same “productivity gains” without paying for these extortionate token prices by closed model providers or have it as cheap as possible.

And especially, no silent nerfing of the model.

_fw15d ago

We have this though, right? Compare SOTA local models to where the frontier was last year. There weren't many people complaining that last year's frontier models were incapable.

Next year, and the year after, Fable, GPT 5.5 and Gemini 3.5 will feel quite ordinary. And perhaps even within reach of a prosumer running models locally.

beklein15d ago

A good visual explanation of how text diffusion models like DiffusionGemma work: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

incognito12415d ago

I just *love* the commit message on Github: "Make TPUs go brr"

nullc15d ago

Has anyone evaluated any diffusion LLMs for error spotting?

E.g. run your normal autoregressive LLMs (with MTP whatever, as you like), then run a single diffusion pass over the result, and observe any tokens that diffusion thinks are unlikely.

Then prompt the autoregressive llm with some structured reasoning "<think>Is <diffusion unlikely part> an error? .."

Because the diffusion model is so structurally different perhaps it makes different errors such that this would provide gains even vs running distinct autoregressive LLMs which often make the same errors.

The same argument could apply for RWKV but it would be relatively expensive to apply it as a second pass on a big block of output, while it seems like a diffusion model would be cheaper.

anotherpaul15d ago

Maybe someone can explain: in image generation some models are already using rectified flow. Which was hailed as the next big thing. Are we going to see discrete rectified flow models next or is that unlikely?

jauntywundrkind15d ago

I'm curious how diffusion models do at tool calling, curious what wins there are there.

The video demo of the svg sword is an interesting example of what is so interesting about diffusion models: it's not just putting one token after another to make edits to a file. It's skipping around, it's re-editing previous lines. I feel like forcing it to write too calls is maybe not its best nature.

I feel like perhaps instead of a monolithic edit file tool call, perhaps the diffusion model would be better suited to posting a change stream, a series of edit ops, across multiple files.

RandyOrion14d ago

Thanks gemma team for this release.

Compared to autoregressive decoding, diffusion is huge for local MoE inference because of the improved token generation efficiency, especially for normal GPU + ram offload setting.

However, there are models which are better positioned on the performance vs memory pareto front, i.e. dense models, so I'll just wait.

P.S. QAT is really something as it reduces the performance fluctuations compared to the normal one. Thanks again.

orthoxerox14d ago

It's nice that Unsloth has already published the model on HF, but it requires a fork of llama.cpp to run at the moment.

diimdeep14d ago

I wish labs would do QAT and release these quants, at this point looking at releases of bf16 without QAT feels like looking at half backed bread, we can quantized it but it is not the same as QAT. Or I am missing something here ?

zamalek15d ago

Is anyone doing text diffusion in latent space instead of tokens?

chc415d ago

it just me that thinks its kinda weird that they conflate speed in tokens/second and latency, when i think of latency as time to first token? like it generates an entire paragraph of tokens faster but wouldnt it still be slower if your reply is only 1 word because it has to do the entire 256 tokens as a chunk

bandrami15d ago

I always thought that fundamentally diffusors were the cooler idea of the two

hmate915d ago

I can’t help but feel like there’s something here that will matter for future LLMs.

The bidirectionality could be a big deal: being able to refine a sentence with both left and right context feels closer to how editing/thinking actually works than committing to each token forever.

Maybe the current models aren’t good enough yet, but the direction feels important.

j / k navigate · click thread line to collapse

88 comments

78 comments · 27 top-level

vineyardmike15d ago· 16 in thread

Anyways, so excited for an open-weight model and I hope it performs well. I’ll be testing this ASAP.

onlyrealcuzzo15d ago

bee_rider15d ago

How do you measure “true” complexity? Cyclomatic seems a bit… I dunno, artificial? Blunt? But it has the benefit of being defined.

1 more reply

fridder14d ago

I wonder if a dedicated client or mode in a client would provide some benefits. Might also be interesting to do adversarial stuff too where it argues with itself or another model

Daishiman15d ago

What metrics have you found useful?

yeodev15d ago

irthomasthomas15d ago

  llm consortium save cns-glm -m glm-5.2 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-kimi -m k2.6 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-meta-glm-kimi -m cns-glm -m cns-kimi --arbiter mercury-2 --judging-method synthesis

Now when I prompt cns-meta-glm-kimi it will pick the best of five from kimi and glm before creating a synthesis from the two winners.

SwellJoe15d ago

I've found the average output of many suboptimal models is still suboptimal, especially when it comes to judging the accuracy/correctness of the work of other models.

1 more reply

evilturnip15d ago

bpavuk15d ago

embedding-shape15d ago

> I like my boilerplate POJOs/data classes generated at breakneck pace of 300+ tok/s

Regardless of speed, use the LLM to eliminate the need for boilerplate rather than just creating more code faster.

> if it's too slow, you just stay in that goddamn async death loop

Slow is fast, longer-term anyways.

1 more reply

elxr15d ago

10-20 seconds times a couple turns on a new feature isn't bad. Kimi is also similarly fast if not faster.

I do agree with smaller models for more constrained/routine tasks though.

1 more reply

fittingopposite14d ago

Mercury is a US LLM from https://www.inceptionlabs.ai/

desireco4215d ago

Wow... I forgot about that. Mercury is brutal. I had him review lint errors and the speed is just insane

skybrian15d ago

Could you say more about how you use it? What does your workflow look like?

vineyardmike15d ago

andai15d ago

So you're making smaller edits?

samuelknight15d ago· 4 in thread

Some of these comments miss the advantage of diffusion. This is will have a big impact on edge devices, such as your phone or the GPU in your computer.

zozbot23415d ago

zozbot23414d ago

BarakWidawsky15d ago

You’re mostly right but conflating attention with autoregressive/causal which is the real issue that prevents you from using more compute

You can use diffusion with attention, and this model does in fact use attention

samuelknight15d ago

Yes, I should have said autoregressive.

SwellJoe15d ago· 4 in thread

halJordan14d ago

SwellJoe14d ago

If they're gimping it, why is nobody else making a better one that small?

zozbot23415d ago

The nice thing about DeepSeek is its ability to be run on local hardware, with no API costs involved. If you care deeply about that, then it being a bit worse than Opus or GPT isn't really a problem.

bArray15d ago

After the AI bubble bursts, it will be the likes of Google that come out the other side still wearing their shirts. I think this bubble is out to scalp some giants.

kkukshtel15d ago· 4 in thread

I think this is the future. The sort of left-field rumble that turns into a quake in 5 years.

famouswaffles15d ago

regularfry15d ago

It's fast enough that "ask it twice and pick the best" should still come out ahead performance-wise. I don't know how much that would close the quality gap by, but it's worth a play.

lambda15d ago

This may be the future of local models.

The thing is, diffusion models perform somewhat worse than autoregressive on text. So you lose some performance.

horsawlarway15d ago

I'm particularly curious to know how this plays out, and I seriously hope that more labs focus on diffusion models for text usage.

My immediate thought - this performs slightly worse than the autoregressive gemma equivalent, but it may also let me functionally run better models in diffusion variants.

Ex - I can run 70b-120b autoregressive models locally right now, but I get ~5-15t/s, which just isn't fast enough for serious work.

Which caps me down in the 20-36b models (ex - gemma4) where I can get 100+t/s on the same hardware.

So the question becomes - does the quality drop from a diffusion model outweigh the quality bump from using a larger model?

Because if not... sounds like diffusion models have a lot of space to thrive.

---

Sadly - if they can't be hosted profitably, I question whether this space will actually be explored.

1 more reply

simonw15d ago· 3 in thread

NVIDIA are hosting a free endpoint for this one, details at https://build.nvidia.com/google/diffusiongemma-26b-a4b-it - you have to create an account and (I think) verify a phone number too.

(I got it to draw a pelican: https://tools.simonwillison.net/markdown-svg-renderer#url=ht... )

dr_kiszonka15d ago

Maybe with very fast models you could request animation frames, e.g., frame 1) right foot at 12, left foot at 6; frame 2) right foot at 3, left foot at 9, etc.?

And instead of reporting tps, you would - of course! - report pfps (pelican frames per second).

alfirous15d ago

I register few weeks ago, the account still not verified, despite following the procedure. Can't use API if the account not verified.

ramses014d ago

Thought of you this afternoon, "after you click the record button can you make a 'boop, boop, boop, clack!' like a lead-in from a from a clapboard (using web audio synthesis apis)?"

...was quite surprising the result!

minimaxir15d ago· 3 in thread

ac2915d ago

> I'm curious what the downside for this speed is here

GaggiX15d ago

So the diffusion process takes more GFLOPs, if you have enough users you can already balance memory and compute.

minimaxir15d ago

Batching is a fair counterpoint.

LarsDu8815d ago· 3 in thread

Does anyone know of the current intrinsic limitations with Diffusion text models compared to autoregressive?

I ran this question by ChatGPT and Claude and they came up with limitations in GRPO RLVR, but I'm not sure..

yorwba15d ago

LarsDu8815d ago

The obvious solution is to simply do more steps for larger sequences though, right?

How exactly does this work with CoT?

robkop15d ago

CoT legibility largely disappears which is quite concerning from a safety perspective

najarvg15d ago· 3 in thread

Do diffusion models support tool calls? If so is the tool call support on par with autoregressive models or worse? (edited spelling)

emilfihlman15d ago

Any text generation model can easily be made to support tool calls.

wsintra202214d ago

omlx.server - WARNING - POST /v1/chat/completions -> 400: Tool calling is not supported with diffusion models.

loopkid12d ago

bachmeier15d ago· 2 in thread

regularfry15d ago

This is a different model with, confusingly, approximately the same number of params as the existing gemma4 MoE. Unclear from a quick scan whether one was trained somehow from the other.

bachmeier15d ago

Thanks. I found this other comment that links to a very thorough explanation: https://news.ycombinator.com/item?id=48479042

2 more replies

SkitterKherpi15d ago· 2 in thread

Mashimo15d ago

Maybe writing / bootstraping unit tests?

Does not need opus level to write, and easy to iterate on.

SkitterKherpi15d ago

xnx15d ago· 2 in thread

Is the diffusion approach any use in Multi-Token Prediction (MTP) drafters? https://blog.google/innovation-and-ai/technology/developers-...

fcanesin15d ago

Yes, DFlash is currently a SOTA speculative decoding method that Xiaomi just used in their MiMo model for >1000tkps

doctorpangloss15d ago

petercooper15d ago· 2 in thread

osanseviero14d ago

Hi! What implementation are you using? Right now VLLM is the one recommended. llama.cpp is in an early draft

petercooper14d ago

schmorptron15d ago· 1 in thread

schmorptron15d ago

got one answer by reading the rest of the comments, makes sense that the diffusion process is inherently reasoning-like: https://www.inceptionlabs.ai/blog/introducing-mercury-2

roosgit15d ago· 1 in thread

Can LoRAs be used to increase the quality of these diffusion models? Nvidia mentions something about this https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B#inf...

pilooch15d ago

Yes, full ft or lora https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guid...

rvz15d ago· 1 in thread

We need more local open weight models that are performant and just as good (or good enough) as the best frontier ones.

And especially, no silent nerfing of the model.

_fw15d ago

We have this though, right? Compare SOTA local models to where the frontier was last year. There weren't many people complaining that last year's frontier models were incapable.

Next year, and the year after, Fable, GPT 5.5 and Gemini 3.5 will feel quite ordinary. And perhaps even within reach of a prosumer running models locally.

beklein15d ago

A good visual explanation of how text diffusion models like DiffusionGemma work: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

incognito12415d ago

I just *love* the commit message on Github: "Make TPUs go brr"

nullc15d ago

Has anyone evaluated any diffusion LLMs for error spotting?

E.g. run your normal autoregressive LLMs (with MTP whatever, as you like), then run a single diffusion pass over the result, and observe any tokens that diffusion thinks are unlikely.

Then prompt the autoregressive llm with some structured reasoning "<think>Is <diffusion unlikely part> an error? .."

The same argument could apply for RWKV but it would be relatively expensive to apply it as a second pass on a big block of output, while it seems like a diffusion model would be cheaper.

anotherpaul15d ago

jauntywundrkind15d ago

I'm curious how diffusion models do at tool calling, curious what wins there are there.

I feel like perhaps instead of a monolithic edit file tool call, perhaps the diffusion model would be better suited to posting a change stream, a series of edit ops, across multiple files.

RandyOrion14d ago

Thanks gemma team for this release.

Compared to autoregressive decoding, diffusion is huge for local MoE inference because of the improved token generation efficiency, especially for normal GPU + ram offload setting.

However, there are models which are better positioned on the performance vs memory pareto front, i.e. dense models, so I'll just wait.

P.S. QAT is really something as it reduces the performance fluctuations compared to the normal one. Thanks again.

orthoxerox14d ago

It's nice that Unsloth has already published the model on HF, but it requires a fork of llama.cpp to run at the moment.

diimdeep14d ago

zamalek15d ago

Is anyone doing text diffusion in latent space instead of tokens?

chc415d ago

bandrami15d ago

I always thought that fundamentally diffusors were the cooler idea of the two

hmate915d ago

I can’t help but feel like there’s something here that will matter for future LLMs.

The bidirectionality could be a big deal: being able to refine a sentence with both left and right context feels closer to how editing/thinking actually works than committing to each token forever.

Maybe the current models aren’t good enough yet, but the direction feels important.

j / k navigate · click thread line to collapse