Mercury 2: Fast reasoning LLM powered by diffusion (opens in new tab)

(inceptionlabs.ai)

351 pointsfittingopposite29d ago128 comments

128 comments

It could be interesting to do the metric of intelligence per second.

ie intelligence per token, and then tokens per second

My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.

But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.

estsauver29d ago

I think there's clearly a "Speed is a quality of it's own" axis. When you use Cereberas (or Groq) to develop an API, the turn around speed of iterating on jobs is so much faster (and cheaper!) then using frontier high intelligence labs, it's almost a different product.

Also, I put together a little research paper recently--I think there's probably an underexplored option of "Use frontier AR model for a little bit of planning then switch to diffusion for generating the rest." You can get really good improvements with diffusion models! https://estsauver.com/think-first-diffuse-fast.pdf

refulgentis29d ago

I'm very worried for both.

Cerebras requires a $3K/year membership to use APIs.

Groq's been dead for about 6 months, even pre-acquisition.

I hope Inception is going well, it's the only real democratic target at this. Gemini 2.5 Flash Lite was promising but it never really went anywhere, even by the standards of a Google preview

7 more replies

volodia29d ago

We agree! In fact, there is an emerging class of models aimed at fast agentic iteration (think of Composer, the Flash versions of proprietary and open models). We position Mercury 2 as a strong model in this category.

estsauver29d ago

Do you guys all think you'll be able to convert open source models to diffusion models relatively cheaply ala the d1 // LLaDA series of papers? If so, that seems like an extremely powerful story where you get to retool the much, much larger capex of open models into high performance diffusion models.

(I can also see a world where it just doesn't make sense to share most of the layers/infra and you diverge, but curious how you all see the approach.)

bigbuppo29d ago

Maybe make that intelligence per token per relative unit of hardware per watt. If you're burning 30 tons of coal to be 0.0000000001% better than the 5 tons of coal option because you're throwing more hardware at it, well, it's not much of a real improvement.

estsauver29d ago

I think the fast inference options have historically been only marginally more expensive then their slow cousins. There's a whole set of research about optimal efficiency, speed, and intelligence pareto curves. If you can deliver even an outdated low intelligence/old model at high efficiency, everyone will be interested. If you can deliver a model very fast, everyone will be interested. (If you can deliver a very smart model, everyone is obviously the most interested, but that's the free space.)

But to be clear, 1000 tokens/second is WAY better. Anthropic's Haiku serves at ~50 tokens per second.

jakubtomanik29d ago

Intelligence per second is a great metric. I never could fully articulate why I like Gemini 3 Flash but this is exactly why. It’s smart enough and unbelievably fast. Thanks for sharing this

josephg29d ago

Yeah I agree with this. We might be able to benchmark it soon (if we can’t already) but asking different agentic code models to produce some relatively simple pieces of software. Fast models can iterate faster. Big models will write better code on the first attempt, and need less loop debugging. Who will win?

At the moment I’m loving opus 4.6 but I have no idea if its extra intelligence makes it worth using over sonnet. Some data would be great!

estsauver29d ago

For what it's worth, most people already are doing this! Some of the subagents in Claude Code (Explore, I think even compaction) default to Haiku and then you have to manually overwrite it with an env variable if you want to change it.

Imagine the quality of life upgrade of getting compaction down to a few second blip, or the "Explore" going 20 times faster! As these models get better, it will be super exciting!

2 more replies

nubg29d ago

Interesting perspective. Perhaps also the user would adopt his queries knowing he can only to small (but very fast) steps. I wonder who would win!

jdthedisciple29d ago

Interesting suggestion.

Maybe we could use some sort of entropy-based metric as a proxy for that?

dmichulke29d ago

Useful for evaluating people as well

irishcoffee28d ago

I really thought this was sarcasm. Intelligence per token? Intelligence at all, in a token? We don’t even agree on how to measure _human_ intelligence! I just can’t. Artificially intelligent indeed. Probably the perfect term for it, you know in lieu of authentic intelligence.

picard_facepalm.jpg

volodia29d ago

Co-founder / Chief Scientist at Inception here. If helpful, I’m happy to answer technical questions about Mercury 2 or diffusion LMs more broadly.

nowittyusername29d ago

How does the whole kv cache situation work for diffusion models? Like are there latency and computation/monetary savings for caching? is the curve similar to auto regressive caching options? or maybe such things dont apply at all and you can just mess with system prompt and dynamically change it every turn because there's no savings to be had? or maybe you can make dynamic changes to the head but also get cache savings because of diffusion based architecture?... so many ideas...

volodia29d ago

There are many ways to do it, but the simplest approach is block diffusion: https://m-arriola.com/bd3lms/

There are also more advanced approaches, for example FlexMDM, which essentially predicts length of the "canvas" as it "paints tokens" on it.

CamperBob229d ago

Seems to work pretty well, and it's especially interesting to see answers pop up so quickly! It is easily fooled by the usual trick questions about car washes and such, but seems on par with the better open models when I ask it math/engineering questions, and is obviously much faster.

volodia29d ago

Thanks for trying it and for the thoughtful feedback, really appreciate it. And we’re actively working on improving quality further as we scale the models.

bcherry29d ago

you mention voice ai in the announcement but I wonder how this works in practice. most voice AI systems are bound not by full response latency but just by time-to-first-non-reasoning-token (because once it heads to TTS, the output speed is capped at the speed of speech and even the slowest models are generating tokens faster than that once they start going).

what do ttft numbers look like for mercury 2? I can see how at least compared to other reasoning models it could improve things quite a bit but i'm wondering if it really makes reasoning viable in voice given it seems total latency is still in single digit seconds, not hundreds of milliseconds

PranayKumarJain29d ago

Spot on about the TTFT bottleneck. In the voice world, the "thinking" silence is what kills the illusion.

At eboo.ai, we see this constantly—even with faster models, the orchestrator needs to be incredibly tight to keep the total loop under 500-800ms. If Mercury 2 can consistently hit low enough TTFT to keep the turn-taking natural, that would be a game changer for "smart" voice agents.

Right now, most "reasoning" in voice happens asynchronously or with very awkward filler audio. Lowering that floor is the real challenge.

1 more reply

mynti29d ago

I always wondered how these models would reason correctly. I suppose they are diffusing fixed blocks of text for every step and after the first block comes the next and so on (that is how it looks in the chat interface anyways). But what happens if at the end of the first block it would need information about reasoning at the beginning of the first block? Autoregressive Models can use these tokens to refine the reasoning but I guess that Diffusion Models can only adjust their path after every block? Is there a way maybe to have dynamic block length?

nl29d ago

I had a very odd interaction somewhat similar to how weak transformer models get into a loop:

https://gist.github.com/nlothian/cf9725e6ebc99219f480e0b72b3...

What causes this?

volodia29d ago

This looks like an inference glitch that we are working on fixing, thank you for flagging.

techbro9229d ago

Do you think you will be moving towards drifting models in the future for even more speed?

volodia29d ago

Not imminently, but hard to predict where the field will go

kristianp29d ago

How big is Mercury 2? How many tokens is it trained on?

Is it's agentic accuracy good enough to operate, say, coding agents without needing a larger model to do more difficult tasks?

volodia29d ago

You can think of Mercury 2 as roughly in the same intelligence tier as other speed-optimized models (e.g., Haiku 4.5, Grok Fast, GPT-Mini–class systems). The main differentiator is latency — it’s ~5× faster at comparable quality.

We’re not positioning it as competing with the largest models (Opus 4.5, etc.) on hardest-case reasoning. It’s more of a “fast agent” model (like Composer in Cursor, or Haiku 4.5 in some IDEs): strong on common coding and tool-use tasks, and providing very quick iteration loops.

3 more replies

smusamashah29d ago

Will it be possible to put this on Talaas chip and go even higher speeds?

Topfi29d ago

Have been following your models and semi-regularly ran them through evals since early summer. With the existing Coder and Mercury models, I always found that the trade-offs were not worth it, especially as providers with custom inference hardware could push model tp/s and latency increasingly higher.

I can see some very specific use cases for an existing PKM project, specially using the edit model for tagging and potentially retrieval, both of which I am using Gemini 2.5 Flash-Lite still.

The pricing makes this very enticing and I'll really try to get Mercury 2 going, if tool calling and structured output are truly consistently possible with this model to a similar degree as Haiku 4.5 (which I still rate very highly) that may make a few use cases far more possible for me (as long as Task adherence, task inference and task evaluation aren't significantly worse than Haiku 4.5). Gemini 3 Flash was less ideal for me, partly because while it is significantly better than 3 Pro, there are still issues regarding CLI usage that make it unreliable for me.

Regardless of that, I'd like to provide some constructive feedback:

1.) Unless I am mistaken, I couldn't find a public status page. Doing some very simple testing via the chat website, I got an error a few times and wanted to confirm whether it was server load/known or not, but couldn't

2.) Your homepage looks very nice, but parts of it struggle, both on Firefox and Chromium, with poor performance to the point were it affects usability. The highlighting of the three recommended queries on the homepage lags heavily, same for the header bar and the switcher between Private and Commercial on the Early Access page switches at a very sluggish pace. The band showcasing your partners also lags below. I did remove the very nice looking diffusion animation you have in the background and found that memory and CPU usage returned to normal levels and all described issues were resolved, so perhaps this could be optimized further. It makes the experience of navigating the website rather frustrating and first impressions are important, especially considering the models are also supposed to be used in coding.

3.) I can understand if that is not possible, but it would be great if the reasoning traces were visible on the chat homepage. Will check later whether they are available on the API.

4.) Unless I am mistaken, I can't see the maximum output tokens anywhere on the website or documentation. Would be helpful if that were front and center. Is it still at roughly 15k?

5.) Consider changing the way web search works on the chat website. Currently, it is enabled by default but only seems to be used by the model when explicitly prompted to do so (and even then the model doesn't search in every case). I can understand why web search is used sparingly as the swift experience is what you want to put front and center and every web search adds latency, but may I suggest disabling web search by default and then setting the model up so, when web search is enabled, that resource is more consistently relied upon?

6.) "Try suggested prompt" returns an empty field if a user goes from an existing chat back to the main chat page. After a reload, the suggested prompt area contains said prompts again.

One thing that I very much like and that has gotten my mind racing for PKM tasks are the follow up questions which are provided essentially instantly. I can see some great value, even combining that with another models output to assist a user in exploring concepts they may not be familiar with, but will have to test, especially on the context/haystack front.

volodia27d ago

Thank you for the detailed feedback! I shared this already with the team.

gok29d ago

Do you use fully bidirectional attention or is it at all causal?

bananapub29d ago

would diffusion models benefit from things like Cerebras hardware?

DoctorOetker29d ago

> Mercury 2 doesn't decode sequentially. It generates responses through parallel refinement, producing multiple tokens simultaneously and converging over a small number of steps. Less typewriter, more editor revising a full draft at once.

There has been quite some progress unifying DDPM & SGM as SDE

> DDPM and Score-Based Models: The objective function of DDPMs (maximizing the ELBO) is equivalent to the score matching objectives used to train SGMs.

> SDE-based Formulation: Both DDPMs and SGMs can be unified under a single SDE framework, where the forward diffusion is an Ito SDE and the reverse process uses score functions to recover data.

> Flow Matching (Continuous-Time): Flow matching is equivalent to diffusion models when the source distribution corresponds to a Gaussian. Flow matching offers "straight" trajectories compared to the often curved paths of diffusion, but they share similar training objectives and weightings.

Is there a similar connection between modern transformers and diffusion?

Suppose we look at each layer or residual connection between layers, the context window of tokens (typically a power of 2), what is incrementally added to the embedding vectors is a function of the previous layer outputs, and if we have L layers, what is then the connection between those L "steps" of a transformer and similarly performing L denoising refinements of a diffusion model?

Does this allow fitting a diffusion model to a transformer and vice versa?

dvt29d ago

What excites me most about these new 4figure/second token models is that you can essentially do multi-shot prompting (+ nudging) and the user doesn't even feel it, potentially fixing some of the weird hallucinatory/non-deterministic behavior we sometimes end up with.

volodia29d ago

That is also our view! We see Mercury 2 as enabling very fast iteration for agentic tasks. A single shot at a problem might be less accurate, but because the model has a shorter execution time, it enables users to iterate much more quickly.

lostmsu29d ago

Regular models are very fast if you do batch inference. GPT-OSS 20B gets close to 2k tok/s on a single 3090 at bs=64 (might be misremembering details here).

rahimnathwani29d ago

Right but everyone else is talking about latency, not throughput.

nylonstrung29d ago

I'm not sold on diffusion models.

Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases

Here's more detail on how price/performance stacks up

https://artificialanalysis.ai/models/mercury-2

volodia29d ago

I’d push back a bit on the Pareto point.

On speed/quality, diffusion has actually moved the frontier. At comparable quality levels, Mercury is >5× faster than similar AR models (including the ones referenced on the AA page). So for a fixed quality target, you can get meaningfully higher throughput.

That said, I agree diffusion models today don’t yet match the very largest AR systems (Opus, Gemini Pro, etc.) on absolute intelligence. That’s not surprising: we’re starting from smaller models and gradually scaling up. The roadmap is to scale intelligence while preserving the large inference-time advantage.

ainch29d ago

This understates the possible headroom as technical challenges are addressed - text diffusion is significantly less developed than autoregression with transformers, and Inception are breaking new ground.

nylonstrung29d ago

Very good point- if as much energy/money that's gone into ChatGPT style transformer LLMs were put into diffusion there's a good chance it would outperform in every dimension

nylonstrung29d ago

I changed my mind: this would be perfect for a fast edit model ala Morph Fast Apply https://www.morphllm.com/products/fastapply

It looks like they are offering this in the form of "Mercury Edit"and I'm keen to try it

Karuma28d ago

A simple test I just did:

Me: What are some of Maradona's most notable achievements in football?

Mercury 2 (first sentence only): Dieadona’s most notable football achievements include:

Notice the spelling of "Dieadona" instead of "Maradona". Even any local 3B model can answer this question perfectly fine and instantly... Mercury 2 was so incredibly slow and full of these kinds of unforgivable mistakes.

serjester29d ago

There's a potentially amazing use case here around parsing PDFs to markdown. It seems like a task with insane volume requirements, low budget, and the kind of thing that doesn't benefit much from autoregression. Would be very curious if your team has explored this.

swiftcoder29d ago

Are there any open-weights diffusion LLM models I can play with on my local hardware? Curious about the performance delta of this style of model in more resource constrained scenarios (i.e. consumer Nvidia GPU, not H100s in the datacenter)

nikhil_9928d ago

llada, dream, cdlm, fast-dllm, sdar. i might have missed some.

smusamashah29d ago

Does it mean if it was embedded on a Talaas chip, it could generate ~50,000+ tokens per second?

Havoc28d ago

Think pretty much anything is going to get a enormous speed boost if the model isn’t undergoing mem latency but is just inherently baked into the circuits asic style

tl2do29d ago

Genuine question: what kinds of workloads benefit most from this speed? In my coding use, I still hit limitations even with stronger models, so I'm interested in where a much faster model changes the outcome rather than just reducing latency.

layoric29d ago

I think it would assist in exploiting exploring multiple solution spaces in parallel, and can see with the right user in the loop + tools like compilers, static analysis, tests, etc wrapped harness, be able to iterate very quickly on multiple solutions. An example might be, "I need to optimize this SQL query" pointed to a locally running postgres. Multiple changes could be tested, combined, and explain plan to validate performance vs a test for correct results. Then only valid solutions could be presented to developer for review. I don't personally care about the models 'opinion' or recommendations, using them for architectural choices IMO is a flawed use as a coding tool.

It doesn't change the fact that the most important thing is verification/validation of their output either from tools, developer reviewing/making decisions. But even if don't want that approach, diffusion models are just a lot more efficient it seems. I'm interested to see if they are just a better match common developer tasks to assist with validation/verification systems, not just writing (likely wrong) code faster.

cjbarber29d ago

I've tried a few computer use and browser use tools and they feel relatively tok/s bottlenecked.

And in some sense, all of my claude code usage feels tok/s bottlenecked. There's never really a time where I'm glad to wait for the tokens, I'd always prefer faster.

volodia29d ago

There are few: fast agents, deep research, real-time voice, coding. The other thing is that when you have a fast reasoning model, you spend more effort on thinking in the same latency budget, which pushed up quality.

quotemstr29d ago

Once you make a model fast and small enough, it starts to become practical to use LLMs for things as mundane as spell checking, touchscreen-keyboard tap disambiguation, and database query planning. If the fast, small model is multimodal, use it in a microwave to make a better DWIM auto-cook.

Hell, want to do syntax highlighting? Just throw buffer text into an ultra-fast LLM.

It's easy to overlook how many small day-to-day heuristic schemes can be replaced with AI. It's almost embarrassing to think about all the totally mundane uses to which we can put fast, modest intelligence.

storus29d ago

I'd say using them as draft models for some strong AR model, speeding it up 3x. Diffusion generates a bunch of tokens extremely fast, those can be then passed over to an AR model to accept/reject instead of generating them.

irthomasthomas29d ago

multi-model arbitration, synthesis, parallel reasoning etc. Judging large models with small models is quite effective.

corysama29d ago

Coding auto-complete?

Ross0078129d ago

Diffusion-based reasoning is fascinating - curious how it handles sequential dependencies vs traditional autoregressive. For complex planning tasks where step N heavily depends on steps 1-N, does the parallel generation sometimes struggle with consistency? Or does the model learn to encode those dependencies in a way that works well during parallel sampling?

ilaksh29d ago

It seems like the chat demo is really suffering from the effect of everything going into a queue. You can't actually tell that it is fast at all. The latency is not good.

Assuming that's what is causing this. They might show some kind of feedback when it actually makes it out of the queue.

volodia29d ago

Thank you for your patience. We are working to handle the surge in demand.

vicchenai29d ago

The iteration speed advantage is real but context-specific. For agentic workloads where you're running loops over structured data -- say, validating outputs or exploring a dataset across many small calls -- the latency difference between a 50 tok/s model and a 1000+ tok/s one compounds fast. What would take 10 minutes wall-clock becomes under a minute, which changes how you prototype.

The open question for me is whether the quality ceiling is high enough for cases where the bottleneck is actually reasoning, not iteration speed. volodia's framing of it as a "fast agent" model (comparable tier to Haiku 4.5) is honest -- for the tasks that fit that tier, the 5x speed advantage is genuinely interesting.

sorenjan29d ago

Julia Turc recently did a video about diffusion LLMs as a paid collaboration with Inception: https://www.youtube.com/watch?v=-VGeHZqOk_s

rancar229d ago

My attempt with trying one of their OOTB prompts in the demo https://chat.inceptionlabs.ai resulted in: "The server is currently overloaded. Please try again in a moment."

And a pop-up error of: "The string did not match the expected pattern."

That happened three times, then the interface stopped working.

I was hoping to see how this stacked up against Taalas demo, which worked well and was so fast every time I've hit it this past week.

anshumankmr29d ago

It is capable of that seahorse faux pas. '''Is there a seahorse emoji?

Mercury 2

Thought for a minute

Well?

Mercury 2 Today at 3:06 PM

Yes – Unicode includes a seahorse emoji. It is U+1F9A0 and renders as:

(seahorse)

(If your device or browser doesn’t show the graphic, you may see a placeholder box.)

Ask

Explain

'''

espadrine29d ago

Interestingly, while it uses diffusion, it generates incorrect information, and it doesn't fix it when later in the text it realizes that it is incorrect:

  > The snail you’re likely thinking of has a different code point:
  > The correct code point for the snail emoji is U+1F40C? No—U+1F40C is SEAHORSE. The snail’s true code point is U+1F40C? Wait, this is still confusing.
  >
  > Let’s set the record straight:
  > 
  >    Snail → U+1F40C? No, that’s wrong.  
  >    Snail → U+1F40C? Still wrong.
  >
  > The snail emoji is actually U+1F40C? I’m mixing things up.
  >
  > Correct snail code point
  >
  > The snail emoji is U+1F40C? No…  
  >
  > After double‑checking the official Unicode tables (Unicode 13.0 and later), the snail is U+1F40C? The…  
  > Given the confusion, the safest answer is: the Unicode Standard defines U+1F40C as “SEAHORSE”. If your device shows a snail, it’s a rendering quirk, not a change in the underlying code point.

pennomi28d ago

I think you’ll have to do multi-shot generation to correct this, each diffusion is going to represent a single “thought”.

Though with the speed it’s running that’s not necessarily a deal breaker. I suspect diffusion models will need different harnesses to be effective.

nowittyusername29d ago

Nice, I'm excited to try this for my voice agent, at worst it could be used to power the human facing agent for latency reduction.

volodia29d ago

Would love to hear about your experience. Send us an email.

mhitza29d ago

Comment retracted. My bad, missed some details.

pants229d ago

Reading such obvious LLM-isms in the announcement just makes me cringe a bit too, ex.

> We optimize for speed users actually feel: responsiveness in the moments users experience — p95 latency under high concurrency, consistent turn-to-turn behavior, and stable throughput when systems get busy.

selcuka29d ago

I think your comment is a bit unfair.

> no reasoning comparison

Benchmarks against reasoning models:

https://www.inceptionlabs.ai/blog/introducing-mercury-2

> no demo

https://chat.inceptionlabs.ai/

> no info on numbers of parameters for the model

This is a closed model. Do other providers publish the number of parameters for their models?

> testimonials that don't actually read like something used in production

Fair point.

volodia29d ago

Just to clarify one point: Mercury (the original v1, non-reasoning model) is already used in production in mainstream IDEs like Zed: https://zed.dev/blog/edit-prediction-providers

Mercury v1 focused on autocomplete and next-edit prediction. Mercury 2 extends that into reasoning and agent-style workflows, and we have editor integrations available (docs linked from the blog). I’d encourage folks to try the models!

mhitza29d ago

You are right edited my post (twice actually). Missed the chat first time around (though its hard to see it as a reasoning model when chain of thought is hidden, or not obvious. I guess this is the new normal), and also missed the reasoning table because text is pretty small on mobile and I thought its another speed benchmark.

1 more reply

lprimeisafk29d ago

It fails the car wash test

emi2k0129d ago

The test:

> The car wash is 100m away from my house. Should I walk or drive?

I didn't find a way to share the conversation, so here it is the full response in `medium` settings:

---

Walking 100 m is a very short distance—roughly a minute or two at a normal pace. In most situations, walking will be quicker overall because you avoid the time needed to get into the car, start it, and find parking. It also saves fuel and is better for the environment.

*Consider these factors before deciding:*

| Factor | Walking | Driving | |--------|---------|---------| | *Time* | ~1–2 min (plus a few seconds to step out) | ~1 min driving, plus ~1–2 min to get in/out and park | | *Energy/Fuel* | None | Uses a small amount of fuel | | *Weather* | Comfortable if dry & mild | Better if it’s raining, very hot, or icy | | *Physical ability* | Easy for most people | Not needed for such a short trip | | *Car availability* | Not required | Must have a car nearby and be willing to use it | | *Environmental impact* | Zero emissions | Small emissions |

*Recommendation:* If the weather is pleasant and you don’t need the car for anything else right away, walking is the simplest, fastest, and most eco‑friendly choice. Drive only if you’re dealing with inclement weather, have heavy items to carry, or need the car immediately after the wash.

Do you have any specific constraints (e.g., rain, heavy bags, time pressure) that might affect the decision?

rtfeldman29d ago

If a stranger asks me, "Should I walk or drive to this car wash?" then I assume they're asking in good faith and both options are reasonable for their situation. So it's a safe assumption that they're not going there to get their car washed. Maybe they're starting work there tomorrow, for example, and don't know how pedestrian-friendly the route is.

Is the goal behind evaluating models this way to incentivize training them to assume we're bad-faith tricksters even when asking benign questions like how best to traverse a particular 100m? I can't imagine why it would be desirable to optimize for that outcome.

(I'm not saying that's your goal personally - I mean the goal behind the test itself, which I'd heard of before this thread. Seems like a bad test.)

1 more reply

findjashua28d ago

failed the car wash test.

i think instead of postiioning as a general purpuse reasoning model, they'd have more success focusing on a specific use case (eg coding agent) and benchmark against the sota open models for the use case (eg qwen3-coder-next)

Jianghong9428d ago

Honestly I don't understand why they/any fast-and-error-prone model position themselves as coding agents; my experience tells me that I'd much rather working with a slow-but-correct model and let it run longer session than handholding a fast-but-wrong model.

vinhnx29d ago

This research paper "Mercury: Ultra-Fast Language Models Based on Diffusion" from last year (2025)

https://arxiv.org/pdf/2506.17298

herlon21429d ago

This looks really nice. When will it be available on OpenRouter?

dmix29d ago

I tried Mercury 1 in Zed for inline completions and it was significantly slower than Cursors autocomplete. Big reason why I switched backed to Cursor(free)+Claude Code

chriskanan29d ago

I can see some promise with diffusion LLMs, but getting them comparable to the frontier is going to require a ton of work and these closed source solutions probably won't really invigorate the field to find breakthroughs. It is too bad that they are following the path of OpenAI with closed models without details as far as I can tell.

davistreybig29d ago

This is unbelievably fast

exabrial29d ago

I believe Jimmy Chat is still faster by an order of magnitude…

poly2it29d ago

What does Jimmy Chat have to do with diffusion models?

dhruv300629d ago

I am little underwhelmed by anything diffusion at the moment - they didn't really deliver.

quotemstr29d ago

What isn't these days? I've found it pointless to get upset about it.

dhruv300629d ago

We need a new architecture - i wonder what ilya is cooking.

mlhpdx28d ago

> Proxylity LLC is a technology company that builds and deploys diffusion‑based large language models and multimodal AI platforms for enterprise use.

Um, no it isn’t. Presumably this is the answer to any question about a company it doesn’t know? That’s some hardcore bias baking.

LarsDu8828d ago

Imagine this type of generation with a custom Talaass style ASIC in 18 months from now on a Sonnet quality model for a 5 order magnitude speed up.

The future looks crazy

pennomi28d ago

I have been saying this for a while now. We have barely scratched the surface on both algorithmic and hardware optimizations for AI. I suspect we will definitely get many orders of magnitude speed up on high quality AI.

The real question is if it ends up “smart enough” or we take that extra compute budget and push the boundary further. Right now it seems making the models larger really only works up to a certain point.

LarsDu8828d ago

The big problem with AI has been that it has always been so energy intensive compared to biological intelligence. However once, you bake the models into ASICs, suddenly the power consumption goes way down, and moreover the inference WILL be ~250X faster than it currently is (which is already on par with the speed of a human thinking).

That's a very scary inflection point. Imagine in 24 months, a Opus 4.6 level Diffusion based model etched directly onto silicon using the latest TSMC process node.

At that point knowledge work will incredibly commoditized.

I have Opus 4.6 one-shotting recreations of 90s videogames for less than the inflation adjusted cost of buying those original games when they were released! Now cut that cost down by 250X!

dw5ight29d ago

this looks awesome!!

arjie29d ago

Please pre-render your website on the server. Client-side JS means that my agent cannot read the press-release and that reduces the chance I am going to read it myself. Also, day one OpenRouter increases the chance that someone will try it.

Ross0078128d ago

The diffusion-based approach is fascinating. Traditional transformer LLMs generate tokens sequentially, but diffusion models can theoretically refine the entire output space iteratively. If they've cracked the latency problem (diffusion is typically slower), this could open new architectures for reasoning tasks where quality matters more than speed. Would love to see benchmark comparisons on multi-step reasoning vs GPT-4/Claude.

genodethrowaway28d ago

ai slop

j / k navigate · click thread line to collapse

128 comments

cjbarber29d ago

It could be interesting to do the metric of intelligence per second.

ie intelligence per token, and then tokens per second

estsauver29d ago

refulgentis29d ago

I'm very worried for both.

Cerebras requires a $3K/year membership to use APIs.

Groq's been dead for about 6 months, even pre-acquisition.

I hope Inception is going well, it's the only real democratic target at this. Gemini 2.5 Flash Lite was promising but it never really went anywhere, even by the standards of a Google preview

7 more replies

volodia29d ago

estsauver29d ago

(I can also see a world where it just doesn't make sense to share most of the layers/infra and you diverge, but curious how you all see the approach.)

bigbuppo29d ago

estsauver29d ago

But to be clear, 1000 tokens/second is WAY better. Anthropic's Haiku serves at ~50 tokens per second.

jakubtomanik29d ago

Intelligence per second is a great metric. I never could fully articulate why I like Gemini 3 Flash but this is exactly why. It’s smart enough and unbelievably fast. Thanks for sharing this

josephg29d ago

At the moment I’m loving opus 4.6 but I have no idea if its extra intelligence makes it worth using over sonnet. Some data would be great!

estsauver29d ago

Imagine the quality of life upgrade of getting compaction down to a few second blip, or the "Explore" going 20 times faster! As these models get better, it will be super exciting!

2 more replies

nubg29d ago

Interesting perspective. Perhaps also the user would adopt his queries knowing he can only to small (but very fast) steps. I wonder who would win!

jdthedisciple29d ago

Interesting suggestion.

Maybe we could use some sort of entropy-based metric as a proxy for that?

dmichulke29d ago

Useful for evaluating people as well

irishcoffee28d ago

picard_facepalm.jpg

volodia29d ago

Co-founder / Chief Scientist at Inception here. If helpful, I’m happy to answer technical questions about Mercury 2 or diffusion LMs more broadly.

nowittyusername29d ago

volodia29d ago

There are many ways to do it, but the simplest approach is block diffusion: https://m-arriola.com/bd3lms/

There are also more advanced approaches, for example FlexMDM, which essentially predicts length of the "canvas" as it "paints tokens" on it.

CamperBob229d ago

volodia29d ago

Thanks for trying it and for the thoughtful feedback, really appreciate it. And we’re actively working on improving quality further as we scale the models.

bcherry29d ago

PranayKumarJain29d ago

Spot on about the TTFT bottleneck. In the voice world, the "thinking" silence is what kills the illusion.

Right now, most "reasoning" in voice happens asynchronously or with very awkward filler audio. Lowering that floor is the real challenge.

1 more reply

mynti29d ago

nl29d ago

I had a very odd interaction somewhat similar to how weak transformer models get into a loop:

https://gist.github.com/nlothian/cf9725e6ebc99219f480e0b72b3...

What causes this?

volodia29d ago

This looks like an inference glitch that we are working on fixing, thank you for flagging.

techbro9229d ago

Do you think you will be moving towards drifting models in the future for even more speed?

volodia29d ago

Not imminently, but hard to predict where the field will go

kristianp29d ago

How big is Mercury 2? How many tokens is it trained on?

Is it's agentic accuracy good enough to operate, say, coding agents without needing a larger model to do more difficult tasks?

volodia29d ago

3 more replies

smusamashah29d ago

Will it be possible to put this on Talaas chip and go even higher speeds?

Topfi29d ago

I can see some very specific use cases for an existing PKM project, specially using the edit model for tagging and potentially retrieval, both of which I am using Gemini 2.5 Flash-Lite still.

Regardless of that, I'd like to provide some constructive feedback:

3.) I can understand if that is not possible, but it would be great if the reasoning traces were visible on the chat homepage. Will check later whether they are available on the API.

4.) Unless I am mistaken, I can't see the maximum output tokens anywhere on the website or documentation. Would be helpful if that were front and center. Is it still at roughly 15k?

6.) "Try suggested prompt" returns an empty field if a user goes from an existing chat back to the main chat page. After a reload, the suggested prompt area contains said prompts again.

volodia27d ago

Thank you for the detailed feedback! I shared this already with the team.

gok29d ago

Do you use fully bidirectional attention or is it at all causal?

bananapub29d ago

would diffusion models benefit from things like Cerebras hardware?

DoctorOetker29d ago

There has been quite some progress unifying DDPM & SGM as SDE

> DDPM and Score-Based Models: The objective function of DDPMs (maximizing the ELBO) is equivalent to the score matching objectives used to train SGMs.

> SDE-based Formulation: Both DDPMs and SGMs can be unified under a single SDE framework, where the forward diffusion is an Ito SDE and the reverse process uses score functions to recover data.

Is there a similar connection between modern transformers and diffusion?

Does this allow fitting a diffusion model to a transformer and vice versa?

dvt29d ago

volodia29d ago

lostmsu29d ago

Regular models are very fast if you do batch inference. GPT-OSS 20B gets close to 2k tok/s on a single 3090 at bs=64 (might be misremembering details here).

rahimnathwani29d ago

Right but everyone else is talking about latency, not throughput.

nylonstrung29d ago

I'm not sold on diffusion models.

Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases

Here's more detail on how price/performance stacks up

https://artificialanalysis.ai/models/mercury-2

volodia29d ago

I’d push back a bit on the Pareto point.

ainch29d ago

nylonstrung29d ago

Very good point- if as much energy/money that's gone into ChatGPT style transformer LLMs were put into diffusion there's a good chance it would outperform in every dimension

nylonstrung29d ago

I changed my mind: this would be perfect for a fast edit model ala Morph Fast Apply https://www.morphllm.com/products/fastapply

It looks like they are offering this in the form of "Mercury Edit"and I'm keen to try it

Karuma28d ago

A simple test I just did:

Me: What are some of Maradona's most notable achievements in football?

Mercury 2 (first sentence only): Dieadona’s most notable football achievements include:

serjester29d ago

swiftcoder29d ago

nikhil_9928d ago

llada, dream, cdlm, fast-dllm, sdar. i might have missed some.

smusamashah29d ago

Does it mean if it was embedded on a Talaas chip, it could generate ~50,000+ tokens per second?

Havoc28d ago

Think pretty much anything is going to get a enormous speed boost if the model isn’t undergoing mem latency but is just inherently baked into the circuits asic style

tl2do29d ago

layoric29d ago

cjbarber29d ago

I've tried a few computer use and browser use tools and they feel relatively tok/s bottlenecked.

And in some sense, all of my claude code usage feels tok/s bottlenecked. There's never really a time where I'm glad to wait for the tokens, I'd always prefer faster.

volodia29d ago

quotemstr29d ago

Hell, want to do syntax highlighting? Just throw buffer text into an ultra-fast LLM.

storus29d ago

irthomasthomas29d ago

multi-model arbitration, synthesis, parallel reasoning etc. Judging large models with small models is quite effective.

corysama29d ago

Coding auto-complete?

Ross0078129d ago

ilaksh29d ago

It seems like the chat demo is really suffering from the effect of everything going into a queue. You can't actually tell that it is fast at all. The latency is not good.

Assuming that's what is causing this. They might show some kind of feedback when it actually makes it out of the queue.

volodia29d ago

Thank you for your patience. We are working to handle the surge in demand.

vicchenai29d ago

sorenjan29d ago

Julia Turc recently did a video about diffusion LLMs as a paid collaboration with Inception: https://www.youtube.com/watch?v=-VGeHZqOk_s

rancar229d ago

My attempt with trying one of their OOTB prompts in the demo https://chat.inceptionlabs.ai resulted in: "The server is currently overloaded. Please try again in a moment."

And a pop-up error of: "The string did not match the expected pattern."

That happened three times, then the interface stopped working.

I was hoping to see how this stacked up against Taalas demo, which worked well and was so fast every time I've hit it this past week.

anshumankmr29d ago

It is capable of that seahorse faux pas. '''Is there a seahorse emoji?

Mercury 2

Thought for a minute

Well?

Mercury 2 Today at 3:06 PM

Yes – Unicode includes a seahorse emoji. It is U+1F9A0 and renders as:

(seahorse)

(If your device or browser doesn’t show the graphic, you may see a placeholder box.)

Ask

Explain

'''

espadrine29d ago

Interestingly, while it uses diffusion, it generates incorrect information, and it doesn't fix it when later in the text it realizes that it is incorrect:

  > The snail you’re likely thinking of has a different code point:
  > The correct code point for the snail emoji is U+1F40C? No—U+1F40C is SEAHORSE. The snail’s true code point is U+1F40C? Wait, this is still confusing.
  >
  > Let’s set the record straight:
  > 
  >    Snail → U+1F40C? No, that’s wrong.  
  >    Snail → U+1F40C? Still wrong.
  >
  > The snail emoji is actually U+1F40C? I’m mixing things up.
  >
  > Correct snail code point
  >
  > The snail emoji is U+1F40C? No…  
  >
  > After double‑checking the official Unicode tables (Unicode 13.0 and later), the snail is U+1F40C? The…  
  > Given the confusion, the safest answer is: the Unicode Standard defines U+1F40C as “SEAHORSE”. If your device shows a snail, it’s a rendering quirk, not a change in the underlying code point.

pennomi28d ago

I think you’ll have to do multi-shot generation to correct this, each diffusion is going to represent a single “thought”.

Though with the speed it’s running that’s not necessarily a deal breaker. I suspect diffusion models will need different harnesses to be effective.

nowittyusername29d ago

Nice, I'm excited to try this for my voice agent, at worst it could be used to power the human facing agent for latency reduction.

volodia29d ago

Would love to hear about your experience. Send us an email.

mhitza29d ago

Comment retracted. My bad, missed some details.

pants229d ago

Reading such obvious LLM-isms in the announcement just makes me cringe a bit too, ex.

selcuka29d ago

I think your comment is a bit unfair.

> no reasoning comparison

Benchmarks against reasoning models:

https://www.inceptionlabs.ai/blog/introducing-mercury-2

> no demo

https://chat.inceptionlabs.ai/

> no info on numbers of parameters for the model

This is a closed model. Do other providers publish the number of parameters for their models?

> testimonials that don't actually read like something used in production

Fair point.

volodia29d ago

Just to clarify one point: Mercury (the original v1, non-reasoning model) is already used in production in mainstream IDEs like Zed: https://zed.dev/blog/edit-prediction-providers

mhitza29d ago

1 more reply

lprimeisafk29d ago

It fails the car wash test

emi2k0129d ago

The test:

> The car wash is 100m away from my house. Should I walk or drive?

I didn't find a way to share the conversation, so here it is the full response in `medium` settings:

---

*Consider these factors before deciding:*

Do you have any specific constraints (e.g., rain, heavy bags, time pressure) that might affect the decision?

rtfeldman29d ago

(I'm not saying that's your goal personally - I mean the goal behind the test itself, which I'd heard of before this thread. Seems like a bad test.)

1 more reply

findjashua28d ago

failed the car wash test.

Jianghong9428d ago

vinhnx29d ago

This research paper "Mercury: Ultra-Fast Language Models Based on Diffusion" from last year (2025)

https://arxiv.org/pdf/2506.17298

herlon21429d ago

This looks really nice. When will it be available on OpenRouter?

dmix29d ago

I tried Mercury 1 in Zed for inline completions and it was significantly slower than Cursors autocomplete. Big reason why I switched backed to Cursor(free)+Claude Code

chriskanan29d ago

davistreybig29d ago

This is unbelievably fast

exabrial29d ago

I believe Jimmy Chat is still faster by an order of magnitude…

poly2it29d ago

What does Jimmy Chat have to do with diffusion models?

dhruv300629d ago

I am little underwhelmed by anything diffusion at the moment - they didn't really deliver.

quotemstr29d ago

What isn't these days? I've found it pointless to get upset about it.

dhruv300629d ago

We need a new architecture - i wonder what ilya is cooking.

mlhpdx28d ago

> Proxylity LLC is a technology company that builds and deploys diffusion‑based large language models and multimodal AI platforms for enterprise use.

Um, no it isn’t. Presumably this is the answer to any question about a company it doesn’t know? That’s some hardcore bias baking.

LarsDu8828d ago

Imagine this type of generation with a custom Talaass style ASIC in 18 months from now on a Sonnet quality model for a 5 order magnitude speed up.

The future looks crazy

pennomi28d ago

LarsDu8828d ago

That's a very scary inflection point. Imagine in 24 months, a Opus 4.6 level Diffusion based model etched directly onto silicon using the latest TSMC process node.

At that point knowledge work will incredibly commoditized.

I have Opus 4.6 one-shotting recreations of 90s videogames for less than the inflation adjusted cost of buying those original games when they were released! Now cut that cost down by 250X!

dw5ight29d ago

this looks awesome!!

arjie29d ago

Ross0078128d ago

genodethrowaway28d ago

ai slop

j / k navigate · click thread line to collapse