Yes, it's incredible boring to wait for the AI Agents in IDEs to finish their job. I get distracted and open YouTube. Once I gave a prompt so big and complex to Cline it spent 2 straight hours writing code.
But after these 2 hours I spent 16 more tweaking and fixing all the stuff that wasn't working. I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.
I've been more and more only using the "thinking" models of o3 in ChatGPT, and Gemini / Claude in IDEs. They're slower, but usually get it right.
But at the same time I am open to the idea that speed can unlock new ways of using the tooling. It would still be awesome to basically just have a conversation with my IDE while I am manually testing the app. Or combine really fast models like this one with a "thinking background" one, that would runs for seconds/minutes but try to catch the bugs left behind.
I guess only giving a try will tell.
Think of the old example where an auto regressive model would output: "There are 2 possibilities.." before it really enumerated them. Often the model has trouble overcoming the bias and will hallucinate a response to fit the proceeding tokens.
Chain of thought and other approaches help overcome this and other issues by incentivizing validation, etc.
With diffusion however it is easier for the other generated answer to change that set of tokens to match the actual number of possibilities enumerated.
This is why I think you'll see diffusion models be able to do some more advanced problem solving with a smaller number of "thinking" tokens.
This is true in principle for general diffusion models, but I don't think it's true for the noise model they use in Mercury (at least, going by a couple of academic papers authored by the Inception co-founders.) Their model generates noise by masking a token, and once it's masked, it stays masked. So the reverse-diffusion gets to decide on the contents of a masked token once, and after that it's fixed.
We're long past that point of model complexity.
And more importantly it's a simple option+shift+1 away. I simply type something like "fix that" and it has all the context it needs to do its thing. Because it connects to my IDE and sees my open editor and the highlighted line of code that is bothering me. If I don't like the answer, I might escalate to o3 sometimes. Other models might be better but they don't have the same UX. Claude desktop is pretty terrible, for example. I'm sure the model is great. But if I have to spoon feed it everything it's going to annoy me.
What I'd love is for smaller faster models to be used by default and for them to escalate to slower more capable models on a need to have basis only. Using something like o3 by default makes no sense. I don't want to have to think about which model is optimal for what question. The problem of figuring out what model is best to use is a much simpler one than answering my questions. And automating that decision opens the doors to having a multitude of specialized models.
Are you, though?
There are obvious examples of obtaining speed without losing accuracy, like using a faster processor with bigger caches, or more processors.
Or optimizing something without changing semantics, or the safety profile.
Slow can be unreliable; a 10 gigabit ethernet can be more reliable than a 110 baud acoustically-coupled modem in mean time between accidental bit flips.
Here, the technique is different, so it is apples to oranges.
Could you tune the LLM paradigm so that it gets the same speed, and how accurate would it be?
Or just save yourself the time and money and code it yourself like it's 2020.
(Unless it's your employer paying for this waste, in which case go for it, I guess.)
Is this really what people are doing these days?
These models do not reason. They do not calculate. They perform no objectivity whatsoever.
Instead, these models show us what is most statistically familiar. The result is usually objectively sound, or at least close enough that we can rewrite it as something that is.
Btw, why call it "coder"? 4o-mini level of intelligence is for extracting structured data and basic summaries, definitely not for coding.
I agree, the comparison is dated, cherry-picked and doesn't reference the thinking models people do use for coding.
But it's also a bit of a new architecture in early stages of development/testing. Comparing against other small non-thinking models is a good step. It demonstrates the strategy is viable and worth exploring. Time will tell its value. Perhaps a guiding LLM could lean on diffusion to speed up generation. Perhaps we'll see more mixed-architecture models. Perhaps diffusion beats out current LLMs, but from my armchair this seems unlikely.
Saw another on Twitter past few days that looked like a better contender to Mercury, doesn't look like it got posted to LocalLLaMa, and I can't find it now. Very exciting stuff
https://www.reddit.com/media?url=https://i.redd.it/xci0dlo7h...
EDIT: This video in TFA was actually a much cooler demonstration - https://framerusercontent.com/assets/YURlGaqdh4MqvUPfSmGIcao...
To transform the string "AB" to "AC" using the given rules, follow these steps:
1. *Apply Rule 1*: Add "C" to the end of "AB" (since it ends in "B"). - Result: "ABC"
2. *Apply Rule 4*: Remove the substring "CC" from "ABC". - Result: "AC"
Thus, the series of transformations is: - "AB" → "ABC" (Rule 1) - "ABC" → "AC" (Rule 4)
This sequence successfully transforms "AB" to "AC".
¹ https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-...
(Edited to remove direct spoiler for the MU-puzzle, in case people want to try it.)
The cost[1] is US$1.00 per million output tokens and US$0.25 per million input tokens. By comparison, Gemini 2.5 Flash Preview charges US$0.15 per million tokens for text input and $0.60 (non-thinking) output[2].
Hmmm... at those prices they need to focus on markets where speed is especially important, eg high-frequency trading, transcription/translation services and hardware/IoT alerting!
1. https://files.littlebird.com.au/Screenshot-2025-05-01-at-9.3...
Chinese companies will be similarly eager for market share, but not everyone has the access to the same raw capital.
In practice, iiuc, HFT still happens within 10s of milliseconds, and I doubt even current dLLM is THAT fast.
That's part of the reason to compare against older, smaller models since they're at a more comparable stage of development.
You have 2 minutes to cool down a cup of coffee to the lowest temp you can
You have two options:
1. Add cold milk immediately, then let it sit for 2 mins.
2. Let it sit for 2 mins, then add the cold milk.
Which one cools the coffee to the lowest temperature and why?
And Mercury gets this right - while as of right now ChatGPT 4o get it wrong.
So that’s pretty impressive.
To determine which option cools coffee the most, I'll analyze the heat transfer physics involved. The key insight is that the rate of heat loss depends on the temperature difference between the coffee and the surrounding air. When the coffee is hotter, it loses heat faster. Option 1 (add milk first, then wait):
- Adding cold milk immediately lowers the coffee temperature right away
- The coffee then cools more slowly during the 2-minute wait because the temperature difference with the environment is smaller
Option 2 (wait first, then add milk):
- The hot coffee cools rapidly during the 2-minute wait due to the large temperature difference
- Then the cold milk is added, creating an additional temperature drop at the end
Option 2 will result in the lowest final temperature. This is because the hotter coffee in option 2 loses heat more efficiently during the waiting period (following Newton's Law of Cooling), and then gets the same cooling benefit from the milk addition at the end. The mathematical principle behind this is that the rate of cooling is proportional to the temperature difference, so keeping the coffee hotter during the waiting period maximizes heat loss to the environment.
On the assumption that the cold milk is always at a fixed temperature until it's mixed in, then the temperature of coffee at point of mixing is the main factor. Before and after, it follows newton's law of cooling. So we're comparing something like Tenv + [(Tc+Tm)/2 - Tenv]e^(-2) vs (Tenv + [Tc - Tenv]e^(-2) + Tm)/2. The latter is greater than the former only when Tm > Tenv (the milk isn't cold), or in other words it's better to let the coffee cool as much as possible before mixing assuming the milk is colder than the environment.
Another interesting twist is to consider the case where the milk isn't kept at a fixed temperature but is also subject to warming (it's taken out of the fridge). Then the former equation is unchanged but the latter becomes (Tenv + [Tc - Tenv]e^(-2) + Tenv + [Tm - Tenv]e^(-2))/2. But this is equivalent to the former equation, so in this case it doesn't matter when you mix it.
Not 100% confident in both analysis, but I wonder if there's a more intuitive way to see it. I also don't know if deviating from the assumption of equivalent mass & specific temp changes the analysis (it might lead to a small range where for the fixed case, situation 1 is better?) It's definitely not "intuitive" to me.
The puzzle assumes that the room temperature is greater than the cold milk's temperature. When I added that the room temperature is, say, -10 °C, Mercury fails to see the difference.
It would only matter if the air were able to cool the coffee to a temperature less than that of the milk in under 2 minutes.
Okay, let's break this down using the principle of heat transfer (specifically Newton's Law of Cooling):
Heat loss is faster when the temperature difference is greater. A very hot object loses heat to cool surroundings much faster than a warm object.
Option 1: Add milk immediately.
You instantly lower the coffee's temperature by mixing it with cold milk.
Now, this warmer (not hot) mixture sits for 2 minutes. Because the temperature difference between the mixture and the room is smaller, it cools down more slowly over those 2 minutes.
Option 2: Let it sit for 2 mins, then add milk.
The very hot coffee sits for 2 minutes. Because the temperature difference between the hot coffee and the room is large, it loses heat more quickly during these 2 minutes.
After 2 minutes of rapid cooling, you then add the cold milk, lowering the temperature further.
Conclusion:To get the coffee to the lowest temperature, you should choose Option 2: Let it sit for 2 mins, then add the cold milk.
Unless there's a gotcha somewhere in your prompt that I'm missing, like what if the temperature of the room is hotter than the coffee, or so cold that the coffee becomes colder than the milk, or something?
I would be suprised if any models get it wrong, since I assume it shows up in training data a bunch?
ChatGPT:
Option 1 — Add the cold milk immediately — will result in a lower final temperature after 2 minutes.
Why: • Heat loss depends on the temperature difference between the coffee and the environment (usually room temperature). • If you add the milk early, the overall temperature of the coffee-milk mixture is reduced immediately. This lowers the average temperature over the 2 minutes, so less heat is lost to the air. • If you wait 2 minutes to add the milk, the hotter coffee loses more heat to the environment during those 2 minutes, but when you finally add the milk, it doesn’t cool it as much because the coffee’s already cooler and the temp difference between the milk and the coffee is smaller.
Summary: • Adding milk early = cooler overall drink after 2 minutes. • Adding milk late = higher overall temp after 2 minutes, because more heat escapes during the time the coffee is hotter.
Want me to show a simple simulation or visualisation of this?
In math/science questions some things are assumed to be (practically impossibly) instant.
> Mercury gets this right - while as of right now ChatGPT 4o get it wrong.
This is so common a puzzle it's discussed all over the internet. It's in the data used to build the models. What's so impressive about a machine that can spit out something easily found with a quick web search?
I was expecting this model to be no-where near chatGPT
Although someone above is saying 4o-mini got it right so maybe it’s meaningless. Or maybe thinking less helps…
There's already stuff in the wild moving that direction without completely rethinking how models work. Cursor and now other tools seem to have models for 'next edit' not just 'next word typed'. Agents can edit a thing and then edit again (in response to lints or whatever else); approaches based on tools and prompting like that can be iterated on without the level of resources needed to train a model. You could also imagine post-training a model specifically to be good at producing edit sequences, so it can actually 'hit backspace' or replace part of what it's written if it becomes clear it wasn't right, or if two parts of the output 'disagree' and need to be reconciled.
From a quick search it looks like https://arxiv.org/abs/2306.05426 in 2023 discussed backtracking LLMs and https://arxiv.org/html/2410.02749v3 / https://github.com/upiterbarg/lintseq trained models on synthetic edit sequences. There is probably more out there with some digging. (Not the same topic, but the search also turned up https://arxiv.org/html/2504.20196 from this Monday(!) about automatic prompt improvement for an internal code-editing tool at Google.)
Eh, it's mostly what we do. We don't re-type everything every time, but we do type top-to-bottom when we type. As you later mentioned, "next edit" models really strike that balance, and they're like 50% of the value I derive from a tool like Cursor.
I'd love to see more diff-outputs instead of "retyping" everything (with a nice UI for the humans). I suspect that part of the reason we have these "inhuman" actions is because of the chat interface we've been using has lead to certain outputs being more desirable due to the medium.
Something I don't see explored in their presentation is the ability of the model to restore from errors / correct itself. SotA LLMs shine at this, a few back and forth w/ sonnet / gemini pro / etc really solves most problems nowadays.
I'm curious what level of detail they're comfortable publishing around this, or are they going full secret mode?
But all but the first page seems to be missing in this PDF? There is just an abstract and (partial) outline.
>Instead of generating tokens one at a time, a dLLM produces the full answer at once. The initial answer is iteratively refined through a diffusion process, where a transformer suggests improvements for the entire answer at once at every step. In contrast to autoregressive transformers, the later tokens don’t causally depend on the earlier ones (leaving aside the requirement that the text should look coherent). For an intuition of why this matters, suppose that a transformer model has 50 layers and generates a 500-token reasoning trace, the final token of this trace being the answer to the question. Since information can only move vertically and diagonally inside this transformer and there are fewer layers than tokens, any computations made before the 450th token must be summarized in text to be able to influence the final answer at the last token. Unless the model can perform effective steganography, it had better output tokens that are genuinely relevant for producing the final answer if it wants the performed reasoning to improve the answer quality. For a dLLM generating the same 500-token output, the earlier tokens have no such causal role, since the final answer isn’t autoregressively conditioned on the earlier tokens. Thus, I’d expect it to be easier for a dLLM to fill those tokens with post-hoc rationalizations.
>Despite this, I don’t expect dLLMs to be a similarly negative development as Huginn or COCONUT would be. The reason is that in dLLMs, there’s another kind of causal dependence that could prove to be useful for interpreting those models: the later refinements of the output causally depend on the earlier ones. Since dLLMs produce human-readable text at every diffusion iteration, the chains of uninterpretable serial reasoning aren’t that deep. I’m worried about the text looking like gibberish at early iterations and the reasons behind the iterative changes the diffusion module makes to this text being hard to explain, but the intermediate outputs nevertheless have the form of human-readable text, which is more interpretable than long series of complex matrix multiplications.
Based solely on the above, my armchair analysis is that it seems like it's not strictly diffusion in the Langevin diffusion/denoising sense (since there are discrete iteration rounds), but instead borrows the idea of "iterative refinement". You drop the causal masking and token-by-token autoregressive generation, and instead start with a bunch of text and propose a series of edits at each step? On one hand dropping the causal masking over token sequence means that you don't have an objective that forces the LLM to learn a representation sufficient to "predict" things as normally thought, but on the flipside there is now a sort of causal masking over _time_, since each iteration depends on the previous. It's a neat tradeoff.
Subthread https://news.ycombinator.com/item?id=43851429 also has some discussion
It feels like models are becoming fungible apart from the hyperscaler frontier models from OpenAI, Google, Anthropic, et al.
I suppose VCs won't be funding many more "labs"-type companies or "we have a model" as the core value prop companies? Unless it has a tight application loop or is truly unique?
Disregarding the team composition, research background, and specific problem domain - if you were starting an AI company today, what part of the stack would you focus on? Foundation models, AI/ML infra, tooling, application layer, ...?
Where does the value accrue? What are the most important problems to work on?
With the speed this can generate its solutions, you could have it loop through attempting the solution, feeding itself the output (including any errors found), and going again until it builds the "correct" solution.
About 10 000 lines of code, and I only intervened a few times, to revert few commits and once to cut a big file to smaller ones so we could tackle the problems one by one.
I did not expect LLMs to be able to do this so soon. But I just commented to say about aider - the iteration loop really was mostly me pressing return. Especially in the navigator mode PR, as it automatically looked up the correct files to attach to the context
Speed is great but it doesn't seem like other text-based model trends are going to work out of the box, like reasoning. So you have to get dLLMs up to the quality of a regular autoregressive LLM and then you need to innovate more to catch up to reasoning models, just to match the current state of the art. It's possible they'll get there, but I'm not optimistic.
I wonder if the same would be true for a multi-modal diffusion model that can now also speak?
There is also this GitHub project that I played with a while ago that's trying to do this. https://github.com/GAIR-NLP/anole
Are there any OSS models that follow this approach today? Or are we waiting for somebody to hack that together?
This means on custom chips (Cerebras, Graphcore, etc...) we might see 10k-100k tokens/sec? Amazing stuff!
Also of note, funny how text generation started w/ autoregression/tokens and diffusion seems to perform better, while image generation went the opposite way.
They're running Qwen on a traditional LLM pipeline. The "diffusion effect", as it says there, it's just a decorative, lmao. That in itself shouldn't break the deal as I understand you have to put on a show, but, looking at the latency and timing of their outputs this is not a diffusion model, as they claim. They're also not even close to the 1,000 TPS figure they put out.
I'm surprised nobody on this forum got the slightest clue on that. I guess I should 4x my fee again :).
However,
> Prompt: Write a sentence with ten words which has exactly as many r’s in the first five words as in the last five
>
> Response: Rapidly running, rats rush, racing, racing.
o4 mini
https://chatgpt.com/share/681315c2-aa90-800d-b02d-c3ba653281...
That said, token-based models are currently fast enough for most real-time chat applications, so I wonder what other use-cases there will be where speed is greatly prioritized over smarts. Perhaps trading on Trump tweets?
[1] https://framerusercontent.com/assets/cWawWRJn8gJqqCGDsGb2gN0...
Diffusion is an alternative but I am having a hard time understanding the whole "built in error correction" that sounds like marketing BS. Both approaches replicate probability distributions which will be naturally error-prone because of variance.
"Four X"
and
"Four X and seven years ago".
In the first case X could be pretty much anything, but in the second case we both know the only likely completion.
So it seems like there would be a huge advantage in not having to run autogressively. But in practice it's less significant then you might imagine because the AR model can internally model the probability of X conditioned on the stuff it hasn't output yet, and in fact because without reinforcement the training causes it converge on the target probability of the whole output, the AR model must do some form of lookahead internally.
(That said RLHF seems to break this product of the probabilities property pretty badly, so maybe it will be the case that diffusion will suffer less intelligence loss ::shrugs::).
You two may, but I don't. 'Decades'? 'Months'? 'Wives'? 'Jobs'? 'Conservative PMs'?
Groq is heading to a dead end.
If speed is your most important metric, I could still see there being a niche for this.
From a pure VC perspective though, I wonder if they'd be better off Open Sourcing their model to get faster innovation + centralization like Llama has done. (Or Mistral with keeping some models private, some public.)
Use it as marketing, get your name out there, and have people use your API when they realize they don't want to deal with scaling AI compute themselves lol
They're comparing against the fastest models. That's why smaller models are shown.
Put another way, how much would company x be willing to spend on "here's a repo, here are the tests, here is the speed now, make this faster while still passing all the tests". If it "solves" something in cudnn that makes it 10% faster, how much would nvidia pay for this? 1m$? 10m$?
High tech US service industry exports are cooked.
If I remember correctly hyperscalers put their green agendas in stasis now that LLMs are around and that makes me believe that there is a CO2 cost associated.
Still, any improvement is a good news and if diffusion models replace autoregressive models we can invest that surplus in energy in something else useful for the environment.
I reckon it might incidentally happen if optimising for cost of power depending how correlated that is to carbon intensivity of power generation, which admittedly I haven't thought through.
[0] https://epoch.ai/gradient-updates/how-much-energy-does-chatg...