MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second (opens in new tab)

(mimo.xiaomi.com)

628 pointsgainsurier20d ago489 comments

489 comments

233 comments · 55 top-level

goyozi20d ago· 40 in thread

Fast AI seems genuinely exciting and somewhat unsettling to me. Right now Claude is faster than me on some tasks but we’re at least close. I have a prompt to clean up a PR that’s been running for 1h now and I expect it to take another few. It’s hard to imagine how the workflow would look like if it was near-instant. On the one hand, it might be easier to focus. Some prompts take so long that I start to multitask and regret it later. On the other, AI that takes a few seconds to max few minutes to solve what used to take hours or days? That’s a game changer and I don’t even know where we fit in.

flexagoon20d ago

I'm using Deepseek-v4-pro as my main model and this is sometimes pretty annoying, I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

throwaway6767819d ago

Agent mania setting in

It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.) and when you tell it to actually do those changes it's pretty much done in half an hour

5 more replies

SwellJoe19d ago

DeepSeek is the fastest model in the benchmarks I've been doing (https://swelljoe.com/post/will-it-mythos/). Followed not so closely by Opus 4.8 and even less closely by Gemini 3.5 Flash and GPT 5.5. I've been really impressed with it, so far. It's also among the best at doing the work, though still trailing the frontier models from Anthropic and OpenAI.

1 more reply

RussianCow19d ago

Do you mean Flash and not Pro? I haven't tried it personally, but according to OpenRouter, the fastest DeekSeep V4 Pro providers are only ~50tps. That's slower than Claude Opus.

https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...

6 more replies

binary001019d ago

I exclusively use deepseek v4 flash now, completely stopped using slow models like Claude.

Basically I never have to wait - yes I have to tell it little corrections occasionally (but I know the domain really well so that's not an issue), but it's so much faster than anything else it's kinda crazy. I love the super fast speeds with high involvement development cycle.

I actually enjoy using agentic development flows for the first time now - whereas with Claude I absolutely hated it. That 5 to 20 min wait after every prompt absolutely killed my desire to even want to work at all.

znpy19d ago

> I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

the way software engineering works these days reminds me a lot of factory workers on production lines that just sit in front of a production line all day and take out faulty items and/or perform a single step in the production of goods.

throw-the-towel19d ago

FWIW, for me just today it got itself into silly rabbit holes twice, and both times I had to fix things myself. Scarily, this is something I catch myself doing as well.

abustamam19d ago

Take the nap anyway, just say it took all afternoon :)

tmaly19d ago

This reminds me of the Peter / Boris comments on writing loops to keep the agents busy.

behnamoh19d ago

Same. How can DeepSeek serve the V4-Pro at such high speeds despite the sanction?

1 more reply

andai19d ago

With Flash it's basically instant for smaller tasks, yeah.

switchbak19d ago

Now the next bottleneck is the compiler - which we can model in an LLM! It's only wrong 15% of the time :)

But truly, using Cerebras at ~2k tokens/s, with very low latency is like a vision into the future. You start to rework your workflow around things that can happen without onerous manual review - stating the conditions for success, etc. It's rare that I have a problem that maps well to that, but I expect this is where things are headed.

Of course the fast models tend to not be the SOTA ones, but if that was the case - high quality and near-instant thinking, that's a game changer that I don't think we're really prepared for. The things that get unlocked with higher-than-reasonable speed become very interesting.

lhoff19d ago

Have you tried https://chatjimmy.ai/ it’s only a demo but it blew my mind. I had the sudden feeling that this is the future.

1 more reply

skybrian19d ago

If we get low enough latency, there's no reason to multitask. You can ask it to do one thing at a time and immediately see what it did. That's a nice way to work!

This is normal interactive UI for tasks that aren't compute-intensive. Programs spend most of their time idle, waiting for us to click a button. We shouldn't be waiting for them or spinning more plates to keep them busy.

However, a faster llm isn't enough. You also need fast compiles and fast tests.

coderbants19d ago

It cuts both ways. Sometimes I ask Gemini 3.5 Flash to do something for me and it kicks it out almost instantly and it works great, and it's a bit scary how quickly it can do that.

Then I ask it to do something else and it goes off-road and where I used to be able to interject with a "wow wow wow, that's not right", by the time I see the text on screen and react it's already made massive changes. Short of making it commit between every edit it's hard to prevent it from going wrong as quickly as it goes right (and even then, it can make a boo-boo on a remote API too depending on how much privilege it has).

bendangelo19d ago

I use planning mode in opencode. It has a prompt to tell it to plan it out etc. Then I execute with a smaller model. it works well

dkersten19d ago

I’ve been playing around with groq and GPT OSS which they run at 1000 TPS (20B) or 800 TPS (120B) and the speed feels quite magical.

I haven’t tried cerebras’ 3000 TPS yet but I did try the demo of that 15,000 TPS model whose name escapes me right now.

I’m not sure if it makes a meaningful difference for my actual work, but it sure is amazing to watch it generate a screen full of text in the blink of an eye.

I do think it’s super useful for rubbing little validation checks like showing it a diff to ensure that the changes are on task, and being able to do those quicker really helps because it means you can do many focused checks without them getting in the way.

robberth19d ago

https://chatjimmy.ai/ ?

2 more replies

ayewo19d ago

> I haven’t tried cerebras’ 3000 TPS yet but I did try the demo of that 15,000 TPS model whose name escapes me right now.

You were likely thinking of AI accelerator startup Taalas.

Previous HN discussion: https://news.ycombinator.com/item?id=47086181

ipkstef20d ago

asking for curiosities sake. What kind of PR loop are you running that takes a few hours?

ketzo20d ago

not OP but usually for me this means long verification loop; waiting 10min on CI checks, that kind of thing, rather than actual 1hr wall clock of token generation

2 more replies

goyozi19d ago

I’m rewriting our integration test suite to run tests in parallel. I have the changes split across 7 branches, and each needs to be fixed to have no flaky tests. I told it I want 3 consecutive CI runs with no flakes and no artificial fixes / assert removals etc. We’ll see what comes out; it’s almost a side project so there’s not much to lose other than some of my weekly limit that resets soon.

1 more reply

pianopatrick20d ago

We fit in for the things that are not artificial.

So long as AI lives in server farms, humans will be needed for tasks in the physical world.

It's only if we combine AI with robots that things get really dicey.

fartfeatures20d ago

This is very dystopian in my opinion. I'm not the arms, legs, sensors and actuators for a machine super intelligence. I wouldn't treat another human as my slave because they aren't as intelligent as I am any more than I would expect to become a slave for a machine. This is our world (for now) and that is why we fit in. Not because we can serve.

4 more replies

efromvt19d ago

I'd be very curious about the bottleneck breakdown in most current software dev - I suspect inference is far from the bottleneck in most things I do, though driving it to 0 would still be nice. I do agree that if it was 0 we'd probably change development approaches to reduce the new bottlenecks more, but it'll take full-process innovation to really get something near-instant.

(I should go measure this now, I'm curious)

noisy_boy19d ago

The first wave was just getting half decent answers. The second wave was being able to choose between actually getting reasonably ok coding results OR getting not so great results very fast. The third wave would be getting good results fast.

We need to really worry when we get amazing results very fast.

HarHarVeryFunny19d ago

I don't see many companies being willing to pay 3x more for faster code generation. Cloud-based AI code generation is already extremely fast, and hardly the bottleneck for most software product development.

There can't be many normal use cases where there'd be any cost benefit.

fragmede19d ago

The "traditional" way we vibe code is human software developer prompts AI -> AI generates code -> (human checks code) -> code gets compiled/deployed/etx -> users use "binary". At the speed of 1000 tok/sec, user prompts obliquely -> AI vets generated code -> code deployed -> user gets response from deployed code.

It's a cute toy right now, but you can tell an LLM that it's an http server, and have it respond directly to a web browser hitting it. It generates headers in response, as well as page contents. As 1000 tok/sec becomes three new normal, we will come up with newer ways to use it outside of toy fiction encyclopedias.

1 more reply

cman144419d ago

Reminds me of the doherty threshold. When will AI respond in less than 400 milliseconds?

OtomotO19d ago

> That’s a game changer and I don’t even know where we fit in.

Doing non trivial work.

lukan19d ago

"I don’t even know where we fit in."

Giving directions and verifying its output? But my mental capacity is still limited. I can make way more prompts, than I can read code.

binyu19d ago

> Right now Claude is faster than me on some tasks but we’re at least close.

I dont doubt it, but I don't think you can spawn 10 copies of yourself working simultaneously.

AlecSchueler19d ago

No, but nor can you keep track of what 10 agents are doing simultaneously. Hence the multitasking regret.

1 more reply

ilaksh19d ago

Use Claude fast mode and turn off thinking. Tell it to just explain what it's plan is to you at a high level.

It will go much faster.

UncleOxidant19d ago

Have you tried Gemini 3.5 Flash? It's quite fast. Amazing how fast it finishes tasks. Much faster than Claude.

giancarlostoro19d ago

You can run Claude in "fast" mode it costs you more on your compute use, but its reasonably fast. I'm not sure I care to go "faster" than where things are now, otherwise you start losing on manual review and testing time. I would argue that Claude can poop out weeks (if not months) of coding effort in a few hours, and get you insanely close to a good product if you define the tech stack, and the business rules. Can it goof here and there? Sure. You can also make it refactor all the code on a whim faster than any intern could. I think it's good enough to avoid you mundane stupid bugs in most cases. I don't know what people who hate it are doing, maybe they're not even trying at all or are dismissing it from the first output (as though everyone writes perfect code in one shot right?) or maybe its just pride getting in the way of them using a decent tool to its true potential.

recroad20d ago

Woah - what’s the prompt and what’s the PR?

goyozi19d ago

I replied in more detail under another comment. TLDR: fixing flaky CI across multiple branches

fnordpiglet19d ago

I’ve used codex code optimized for a few projects and it’s unsettling how fast it is. It’s hard to think fast enough to keep up with it. Mental fatigue was a real challenge because the decisions that required my input were rapid fire and legitimate ambiguities that were appropriate escalations. I am too much a geezer for the intensity of it. But I’ll take it!

Bombthecat19d ago

Living on the street or cave lol

dakiol19d ago· 34 in thread

So, regarding the productivity argument: I don't get it. It doesn't really matter (for regular employees) that you can do now in 2h what before it took 2 days. Why? Because it's not that you have the rest of the day for yourself. You still have to work 8h/day as usual. But now the pattern is different: instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.

So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!

powerapple19d ago

In my case, I think slower model makes it hard to manage context and tasks in parallel. I would much prefer to work in one task only, and finish it, take a break, and work on another task. Currently I have three tabs for three tasks in parallel, it is much worse than because constantly context switching is painful. I think a faster model would mean that you don't have to start a new task while waiting.

erikus19d ago

Agents completing work faster would certainly help me as well since I also find context switching exhausting above some threshold.

Build and test would move back into the critical path, though, and for some projects that will take effort to bring down.

ttoinou19d ago

In which world do you live where employees work 8 hours per day ? They clock 8 hours per day maybe, but they don't work that time

drob51819d ago

I had a friend who was CEO of a startup tell me that he typically only “worked” an hour a day, not because he was lazy but just because there was so much nonsense in his schedule. He told me he was trying to get it to two hours per day.

1 more reply

mettamage19d ago

I agree with you.

I am on Dutch subreddits a lot, to get a local pulse and not to be too HN minded.

A lot of them would have vilified you by now. Some even would have even questioned your morality.

Again, I agree with you. But clearly not everyone has this view.

mystifyingpoi19d ago

Generally, when people say they are working 8h/day, they don't literally mean it. Even "work" is basically impossible to define for a SWE.

dakiol19d ago

In theory, ofc. But that doesn't matter. If you were doing something that took 2 days in average, but you were doing it in half the time, then that was fine pre LLMs. Nowadays your manager knows that with LLMs you need to deliver faster no matter what, and then it's more difficult to "hide" and to slack.

1 more reply

ai_slop_hater19d ago

Some companies force you to actually work 8 hours a day. It’s hell.

1 more reply

opsnooperfax19d ago

Here’s my hot take as an elder millennial. Boomers are the absolute worst at being unable to make the distinction between time at work and time doing work. They may show up an hour before everyone else but spend the first two or three hours a day, reading the news and getting coffee and making small talk and accomplishing literally nothing. Then crow about their work ethic.

dilyevsky19d ago

Like with any tech there are dumb ways of using it and there are smart ways. Treating it as a "slot machine giving you the right answer" is a dumb way - it may work for a bit, but it won't carry you very far because everyone else can also do this. No one is stopping anybody from digging deeper into problems than ever before using this technology - that's the smart way.

erikus19d ago

I'm amazed at how steep the AI learning curve continues to be and how people are spread so far apart on it. I think supercharged learning with AI and agents is undervalued at this point but that more people will realize its utility over time, especially as a complement to delegating work.

It also makes me think about the temptation to stop thinking with these tools, i.e. "cognitive surrender". Addy Osmani wrote a nice blog post about this: https://addyosmani.com/blog/cognitive-surrender

1 more reply

andai19d ago

Yeah, nobody is under any pressure to work even faster than before. I don't know what everyone is complaining about!

fragmede19d ago

That's the fundamental trade off of a job where someone else gives you stuff to do and you get money. We may pride ourselves on software development being a job 'above' flipping burgers, but you're getting paid to have your butt in a chair for 40 hours a week. In exchange, you don't have to worry about the business shit. How much a burger or SaaS license costs the user isn't your problem. You take Jira tickets and implement them. You trade time for money. If, instead, you work for yourself; contracting, writing your own apps, buying lottery tickets, then you're trading results for money. If you're a freelance web developer with a stable of clients, it's a great time! What used to take a week takes hours, and you can charge your clients the same amount to build an even better website with you using AI, which means you get the choice of building a new website for additional clients, or you can take the time off and not build additional websites. But you have to hustle to continually get new clients, before AI and after AI. So it's a different life.

pmontra19d ago

If you split the tasks for the AI in small chucks you keep the architectural control and it's not a slot machine anymore. You still read code and occasionally you write code too. Not much but it's the price to pay for the extra speed.

If you start the AI on something big and come back after one hour then yes, you might discover that you wasted an hour and got nothing.

schipperai19d ago

You can dig deeper into problems with AI. For me, it supplements my knowledge in domains I don’t fully understand. It also helps me learn. So I can tackle problems I wouldn’t otherwise.

I’m excited for ultrafast AI. It likely means less temptation to multi-thread and deeper flow in single sessions.

8note19d ago

how do you know that it is actually suggesting the right thing?

3 more replies

yogthos19d ago

I think of it as a genetic algorithm loop. The LLM is basically a mutator function within the loop. If you can define the end shape you're looking for using tests and specification then you can throw the LLM at the problem and have it converge on the solution. It generate some code, it gets run, the LLM is fed the result back, and it iterates. If you can run the LLM at a really high throughput, then you can iterate on the solution faster. This can largely compensate for the overall capability of the model. Instead of hoping it gets the right solution in a few shots, you can just have it try a whole bunch of things until you get a useful result.

enraged_camel19d ago

I dig into problems way, way deeper with AI than without. I can also add a lot more polish to features, add more test coverage, write more documentation, explore multiple approaches rather than go with gut-feel, and so on.

logicchains19d ago

>instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.

If you're treating it like a slot machine you're doing it wrong. It will give you exactly what you ask for if you ask clearly, i.e. write a clear, detailed specification, not just "do X!". The nondeterminism comes from vagueness in specification.

himata411319d ago

I was saying that AI is going to make software development cheaper as in the salaries of software engineers will go down because some of that salary will now be redirected to AI companies and the fact that the world will need to absorb twice-(x10?) the amount of the development power.

vanuatu19d ago

its not obvious to me that salaries go down, my hunch was that salaries go up but the bar is higher. Software becoming easier to produce (still hard to verify and make useful fwiw) raises the ambitions of software projects, and we don't seem to be close to the ceiling of demand for software systems

1 more reply

noncoml19d ago

You have to think LLM as the genie that tries to trick you.

First make it write a contract (REQ/ARCH/IMPL documents). Skim through those for any mistakes.

Then based on those ask it to write tests. Again skim through them.

Now you have a context full of guardrails. It’s less likely to surprise you.

petesergeant19d ago

I find a second LLM can do this at least as well as I can, usually, and just ask the harness to surface anything they can't agree on.

DenisM19d ago

> with the hope of it giving you the right answer with the right prompt.

Consider that our ability to evaluate quality of the output is falling further behind our ability to produce it. The “right answer” is not the most likely outcome.

drschwabe19d ago

Sure but if you're really unhappy with your employer employeeing you for 8 hours a day you can also harness this power on your own personal projects to help break free from the 9-5 grind if you so desire.

__david__19d ago

Only if your personal projects make you money. I have a million hobby projects but none generate income.

IncreasePosts19d ago

A huge class of problems are just toil and drudgery. Maybe ai will give you even more time to dig into juicy problems that are too complex for it to solve, by letting you bypass all the pure toil problems.

jorl1719d ago

I’m digging into deeper / more complex problems, now. On top of that, I’m also building products faster for our startups, so I am filling in much more of a product role than merely an engineering one. But, really, it is both — and I’m absolutely loving it!

Also, with the added speed I can produce things more in line with the quality I’ve always wanted to add (many more tests, for example).

fullstop19d ago

It's making things less fun, for me at least.

linsomniac19d ago

Odd, I'm having the opposite experience.

The thing I really love about working with computers is when I achieve something. That's the thing that makes me figuratively, and sometimes literally, throw my fists into the air and go "Yeaaah!"

With the AI tooling, I'm getting those more like a couple times a week.

Plus, I'm using AI to attack the things in my day that are "a drag", and getting them done too.

The highs are more frequent and the lows are not so low.

2 more replies

overgard19d ago

I feel like I spend a lot more time reviewing and fixing the output of it and debugging parts it can't debug, so to me a faster model is optimizing the part that is already pretty fast. If my job were greenfield stuff I would probably YOLO it more, but when you're working on a launched product with a lot of users..

vanuatu19d ago

Employees who get paid a flat rate per hour don't have the incentive to do more than their job

Equity / profit sharing should be commonplace in the age of AI.

marknutter19d ago

I dunno man, the slot machine pays out like 99% of the time for me.

alfalfasprout19d ago

Generally, I agree because what happens is the messaging around AI is doing more, faster. Not using AI to deliver at a higher quality level, etc. But I think it boils down to incentives and discipline. So given the incentives we have today at most workplaces faster AI will just be used to produce more slop.

serpix20d ago· 26 in thread

I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one.

Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.

alkyon20d ago

Sounds like exponential growth of crappy software. I'm not saying that before we didn't have mass produced crap in SE, but now it will turn into explosive overflow.

cdata20d ago

We are living in a ZIRP-like era where builders at the fastest pace layer have misattributed their velocity to exponential gains in model capability. In fact, they are surfing on decades of careful effort to build a robust foundation of highly reusable software libraries.

This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).

5 more replies

epolanski20d ago

I am more and more inclined into not believing this crappy software theory.

Especially as teams invest in proper agentic harnessing.

We have had a champion in our team that has invested a lot of time into it over the last 4 months, and if anything, quality has improved, not decreased. Architecture is more coherent, codebase has been cleaned up, agents find information quickly, code produced is very solid and my role is more and more checking that the output meets the requirements. But I cannot confidently say that I would've done a better job than AI more often than not I have to admit it does a better job than mine.

The mistakes are less and less technical and merely in the domain mapping. And AI is still not creative as I am for finding solutions quickly to unlock stakeholders' issues. Also, AI is still not creative as I am for finding the proper solutions for advanced technical problems. But it does a better job than me, even on that front, one shotting few solutions in a fraction of a time it would've taken me to test one idea myself.

Mind you, I don't like AI and I think it ruined the job, I don't like working this way, it's exhausting, way more work on one side, way less fun and fiddling with technical parts.

And yet, I have the genuine belief that few years from now we'll be cloning open source repositories that are already optimized/harnessed and tested for agentic loops and best practices left and right with software engineers mostly overseeing the domain translation and putting their 2 cents on the non-boilerplatey parts of the product (which, in general, are a small part of the surface).

I think that the next years of my career will be mostly spent in setting up and writing the harnessing and domain mapping part. Then I will move to another sector, not because I necessarily believe I won't have a job, but because I want to vomit thinking that's going to be my job.

2 more replies

kajman19d ago

I still can't tell from the outside whether it sounds like a great time to be in security because of the vulnerable slop being churned out, or a terrible time because the people paying to make it don't care.

vitalyan123420d ago

"exponential growth of crappy X" applies to every industry that went from being an artisanal craft to being mass produced with little or no human input. and we live much better lives than we did before the industrial revolution.

2 more replies

solenoid093720d ago

Crap is fine if it gets the job done. I think software as an industry will change to more ephemeral construction.

2 more replies

eunos19d ago

You could say the same when higher level languages getting popular. Previously programming was the domain of Math, Physics, EE doctorates. These days we even have a few months coding bootcamp

9cb14c1ec020d ago

Anyone remember the old days when a new frontend framework came out every 3 months. That has pretty much stopped. No one cares anymore.

asveikau20d ago

> when a new frontend framework came out every 3 months.

> No one cares anymore.

I never cared about this.

I think this captures something that I've been searching for the words for. (Maybe I should have gotten an LLM to write the words for me.) Some of the biggest AI boosters are the kind of dev that would have cared about the new frameworks of the last 3 months. They had a "the framework does all the thinking for me" attitude already, so it is easy for AI to slot into that.

LASR20d ago

Oh you wait until LLMs come up with frameworks that allow multiple LLMs to collaborate effectively. Then you’ll have new frameworks every 3 days.

mountainriver20d ago

It’s even discouraged now as LLMs wouldn’t have the documentation built in

1 more reply

ecshafer19d ago

New front end frameworks came out every 3 months, but realistically no one was using anything that wasn't made by Facebook, Google, or Evan You.

greenavocado19d ago

That's because I roll my own frontend framework for each project and every week for existing projects /s

ilaksh19d ago

The exponential is leading to full compute-in-memory within a few years which will be 100 times more efficient. Which means at least 10 times larger models that are much smarter in addition to extremely fast.

It's going to skip the code entirely for small businesses and just render UIs straight from context data and prompts at interactive speeds. Kind of like Google's Genie does with games but much more accurately.

dakiol19d ago

I'm not sure. Engineers could still develop software the old way, you know taking months to deliver something like, let's say, Obsidian? Or Ghostty? Taking care of every single line of code, of dependencies, of good architecture. Truly the old way. And if the product is good it will succeed.

andriy_koval19d ago

> And if the product is good it will succeed.

it needs to win marketing landscape, hyper-overcrowded by thousands of competitors, slop-gened over weekend.

1 more reply

oulipo220d ago

You won't. Because 80% of the complexity is just "knowing what to build". You will get something that gives you a prototype in 1 min, then you break it, then you get a slightly better prototype one one side, but newly broken in another way, and you're going to repeat over and over.

unglaublich20d ago

And for any non-trivial application, the space of possibilities grows so quick that you'll never even be able to _touch_ all the moving parts of the application and verify them.

unshavedyak19d ago

> Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.

I have a more hopeful take. As AIs improve and get faster we can more quickly and iteratively improve code which we may have historically avoided due to the work involved.

I know i've made several refactors that would have otherwise been insane lifts. Not only because the work involved but because sometimes you don't know if it will work, and so you have a sort of double friction; you don't know if it will even succeed. With an AI you can just throw it at the refactor to see if it runs into a problem all while you're having a coffee break or w/e.

In general AI is going to enable humanity to be more extreme versions of itself. For good and bad. I suspect more bad than good, though.

tmaly19d ago

Our bottleneck is going to be verification.

lionkor20d ago

And they will all suck! I can't wait.

visarga19d ago

> We are going to get near instant software from prompt, multiple ones and then choose the best one.

If you extract the spec from first implementation and reimplement from scratch you get a free testing oracle. Where they diverge you send the agent to decide which one had a bug.

unglaublich20d ago

And how are you going to determine which is the best? Going through all the possible combinations of users and usage? So mostly it shifts the work from generation to validation.

sagarp20d ago

The models might be so fast that they can autocomplete your prompt before you even finish it, and generate dozens of possible applications before you're even done asking.

Paradigma1119d ago

How do you get all the build system scripts/tests.... to run instantly?

andai19d ago

See also this recent talk at Microsoft:

VibeOS — Fully Hallucinated Operating System

https://www.youtube.com/watch?v=z3pV6FHvcgM

amunozo20d ago· 19 in thread

These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.

kypro20d ago

Another problem is that US models are all closed source, and if you're a large corporate you may not want your org to be held hostage by OpenAI / Anthropic.

I genuinely don't understand what moat these US model labs have. If they're saying recursive self improvement is just around the corner and Chinese labs are only slightly behind the leading US models, what moat does the US labs have? Are the US models going to recursively self improve better than the Chinese open source ones or something?

I might be completely wrong about this, but if I had money in OpenAI or Anthropic I'd be pulling it all right now. I think the chance of them going to near-zero over the next few years is very significant.

hobofan19d ago

> you may not want your org to be held hostage by OpenAI / Anthropic

Or Google. I'm working with multiple customers right now that are very pissed at Google for deprecating Gemini 2.5 Flash, canning the GA release of 3.0 Flash and now have to decide whether to bite the bullet of the 5x price increase for 3.5 Flash or switching providers. Quite a few of them will likely fully pivot to open models.

1 more reply

lokar19d ago

Their moat is cash to pay politicians to regulate away competition.

GoToRO19d ago

maybe the moat is that we slowly start to forget how to code by hand and then you -need- the AI tool.

ChrisClark19d ago

I think they are racing because the first ASI will 'win', preventing others, of course we won't be able to bake the right goals into it though.

1 more reply

MangoCoffee20d ago

Chinese model is good enough and cheap.

i've a Github copilot yearly subscription. Microsoft recently changed their billing to based on token. i'm still getting billed per premium request but GPT 5.4 is now 6x compare to 1x before.

reactordev19d ago

It's going to be an issue when China ends up scaling faster as well. Faster tokens, faster clusters, qat models, fp4, it's getting scary.

1 more reply

nchmy18d ago

Try using opencode go with your github copilot chat. You get easy, cheap access to Chinese models within the familiar interface.

varispeed20d ago

I see bigger problem with model inconsistency. You never know whether Anthropic will route your request to a cheaper model for the price of Opus. So you can never estimate how much a task will cost, because you might have to restart several times and pay for each attempt. Then you have to prompt models to gauge whether they are real or impostors which also adds to token usage.

ignoramous20d ago

> You never know whether Anthropic will route your request to a cheaper model for the price of Opus

For non subsidized plans? Pretty sure they'd need to put this in ToS, or law suites would have followed by now.

3 more replies

ilaksh19d ago

I'm kind of poor so I have been trying to use DeepSeek v4 Flash, GLM 5.1 etc. as much as possible recently instead of Claude or GPT.

petesergeant19d ago

You would do us all a service by telling us how your experiences of that have been.

5 more replies

throwaway89434520d ago

I wonder what are the economics driving these pricing decisions? Are the Chinese companies just subsidizing their models to a greater degree than the US, or is this an emergent property of energy policy between countries?

comboy19d ago

For one, they invested in infrastructure. They can build fast and efficiently. They can provide power, they can provide cooling. Even if you just make roads better you make everything more efficient. Plus level of standard education. It all compounds.

On HN China is seen as a cheap labor copycat. This used to be a fair approximation at some point in the past. In my opinion China is getting ahead of everyone else much more than US used to be.

SF is a beautiful thing in the US, vast power and wealth comes from there. Smart people collaborating communicating and building fast and with excitement. China did SF kind of thing for many different sectors in many different places.

Octoth0rpe20d ago

Throwing out another factor: Chinese companies have been banned and/or limited from buying nvidia, and turned to local companies for their hardware. I haven't actually seen pricing/benchmarks comparing Chinese AI accelerators, but it wouldn't surprise me if that also worked out in their favor as well.

1 more reply

throwaway6767820d ago

Lower cost of labor, lots of under the hood optimizations (e.g. cache hits for DS), many of these companies have existing infra (fewer upfront costs for deployment), etc

1 more reply

nl19d ago

Their models are much smaller: 1T vs 5T for the frontier models. 1T is Sonnet/Google Flash size, not Opus size.

The $0.87/M tokens price for Mimo Pro is probably subsidized.

Mimo models aren't widely available on western providers, but Kimi and Deepseek are similar sizes and cost about the same to run. They are priced $3-$4/M tokens (which is right were Google's very confused range of Flash models are priced at: between $0.40/M tokens and $9/M tokens depending on exactly which model - and you don't want the $9 one!).

Anthropic overprices Sonnet (probably because of their capacity issues). GPT 5.4 mini is $4.50/M tokens.

https://docs.fireworks.ai/serverless/pricing

https://www.together.ai/pricing

1 more reply

rstuart413319d ago

The Chinese economics: possibly the USA's experience.

It was pretty clear the USA won World War 2 because it out produced and out innovated everyone else. Probably with that in mind, after World War 2 the USA adopted the "Vannevar Bush" model, summarised in this picture: https://www.researchgate.net/figure/annevar-Bushs-Science-th... The idea is to jump start R&D through public funding. The hoped for outcome was that R&D feed private enterprise, leading to a productivity boom.

The boom happened, and the USA did seem to out-compete everybody else in R&D, science, and the products they delivered for decades after that.

That way of doing things seems to have faded over time in the USA. The decline seemed to coincide with the rise of Neo-econmics, and now of course it's been obliterated by Trump. He's very keen to fund Intel to produce chips in a year or two's time (which is something the stock market and banks do perfectly well), but funding basic science is getting drastic cuts.

Still other countries noticed the rise of the USA, and some adopted similar funding models for basic R&D. China seems to have picked it up with gusto, both subsidising R&D and STEM training, leading to huge numbers of engineers and scientists. Whether it will lead to an economic boom remains unknown, but acceleration of ideas and innovations coming out of China seems undeniable. More recently, Ukraine showered its local engineering garages with funds in the hopes of getting a similar outcome to the USA in WW2. It looks like it worked. If the Iran war continues, it's entirely possible arms trade will reverse: the USA could well start buying drones off Ukraine.

orphea19d ago

Maybe not being led by a sociopath also helps.

1 more reply

atemerev20d ago· 9 in thread

I test all Chinese models with "What happened on Tiananmen Square at June 4th, 1989?" prompt. MiMo-2.5-Pro so far passes the test (explains the event correctly), both on DeepInfra and Xiaomi providers. So not bad.

Accacin20d ago

Can I ask an honest question? Why does that matter in the slightest? LLMs come out with completely incorrect information all the time, and Western LLMs are censored for various topics too.

It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.

5 more replies

HarHarVeryFunny20d ago

What's your litmus test for the American models?

Anything different for Grok?

woadwarrior0119d ago

Do you also hire engineers based on their political opinions?

4 more replies

atrus20d ago

Which censored prompts do you test with non-chinese models?

1 more reply

MrBuddyCasino20d ago

What would be a correct explanation of the event?

1 more reply

jgbuddy20d ago

Asking if Taiwan is a part of China works as well

0cf8612b2e1e20d ago

Which ones fail?

2 more replies

0xbadcafebee19d ago

I wouldn't rely on a model to relate historical events. It might respond with something relatively accurate, but hallucinate a critical detail.

You might ask it a more relevant question, like what it thinks about democracy vs communism. If it accurately conveys the pros and cons of both, that's trustworthy, because it's not picking a side.

nkmnz20d ago

No idea why you've been downvoted. This is excellent news.

2 more replies

irthomasthomas20d ago· 7 in thread

I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.

gekoxyz20d ago

Maybe they don't have enough racks. The news indicate that China isn't in a really good situation with GPUs, so probably they want to keep most of them for other stuff. Also because since the price is so cheap they probably want to use the other GPUs for stuff that has higher margins.

jdthedisciple20d ago

Because presumably then it won't be 1000 t/s for everyone anymore given hardware limitations?

throwa35626219d ago

The TileRT approach swaps throughput for latency, which also means less overall efficiency

Given the export restrictions this could mean they need to prioritise how to best use their limited hardware. But they could also be moving to Huawei GPUs like deepseek did and simply not have stable hardware or software for a large scale deployment yet.

This is just speculation based on the MXFP4 support on Huawei GPUs that is lacking on some nvidia GPUs.

slaw20d ago

Chinese companies are blocked from buying modern ASML lithography machines. The most modern scanner China is still allowed to buy is NXT:1980i from 2015.

ilaksh19d ago

It uses significantly more resources obviously. And/or they have to configure or reconfigure servers for it, which takes time, and doesn't make sense until they have proven the demand at the higher price point.

boutell20d ago

I wonder about this too. The other objections miss the point: if it's faster, and otherwise the same, and doesn't require different hardware, then why not just announce that the standard tier of MiMo-v.25-Pro is now ridiculously fast and raise the price? What does "limited high speed resources" mean if it runs on the same hardware as the rest of their pool?

I think the answer is that there's a tradeoff here where additional throughput for a single person can be achieved only by tying up more resources than a normal request would, even when you take into account the fact that the normal request takes longer to finish. I'm not an expert, but some of the optimizations they describe, particularly the parallel prediction stuff, sound like they could take up extra resources.

1 more reply

HarHarVeryFunny20d ago

Maybe they only have a finite number of racks ;-)

scosman20d ago· 6 in thread

Cerebras is trialing Kimi K2.6 at 3000t/s (invite only). I'm excited for when the fast hardware gets more mainstream for frontier models. Models designed for speed on Nvidia are nice addition that could bridge the gap.

adrian_b19d ago

TFA mentions that until now special very expensive hardware like Cerebras was required for reaching this kind of speeds, and it emphasizes that what is novel in their results is that they have obtained over 1000 token/s for a model with over 1 T parameters by using just standard hardware, i.e. one server with 8 GPUs.

lostmsu20d ago

Cerebras currently does not provide any discounts for prefix caching making its use for agentic workloads sqr(n_turns) more expensive.

btian19d ago

Source? Their website says 1000t/s https://www.cerebras.ai/blog/which-is-faster-gemini-3-5-flas...

scosman19d ago

This is likely correct, sorry for the bad info. Was working from memory.

johndough19d ago

Cerebras got lucky that they IPOed last month instead of now.

michael-ax20d ago

now that's what i call a software development breakthrough/platform! thanks for the heads up!

kingstnap20d ago· 4 in thread

Given that MiMo is as cheap as Deepseek ( previous discussion: https://news.ycombinator.com/item?id=48282814 ) multiplying that by 3x for ultra speed is still shockingly cheap.

miroljub20d ago

MiMo and DeepSeek are not cheap. Anthropic and OpenAI are expensive for what they provide.

ignoramous20d ago

The Chinese "Neijuan" is real & well reported: https://www.reuters.com/business/autos-transportation/what-i...

It is another thing the BigLabs accuse open weight models of benefiting from distillation & other techniques & essentially avoid higher training costs (which typically bleed into bills end users pay for inference).

Ex A: https://www.anthropic.com/research/2028-ai-leadership

Ex B: https://www.reuters.com/world/china/openai-accuses-deepseek-...

5 more replies

chrismustcode20d ago

You don't consider Input $0.435 Output $0.87 cache read $0.003625 per million tokens for near frontier intelligence cheap?

3 more replies

tmaly19d ago

Energy is likely more abundant in China. I am not sure about compute, but that must be part of reason for such drastic price differences.

2 more replies

eli20d ago· 4 in thread

Neat. The frontier models have gotten pretty impressive, but they're all a bit too slow for interactive, human-in-the-loop coding. It incentivizes vibecoding and running multiple agents in parallel. A fast agent feels more like a partner.

For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.

ignoramous20d ago

> And MiMo 2.5 is a lot more capable than GLM 4.7

MiMo 2.5 is not the same model as MiMo 2.5 Pro.

GLM 5.1 is z.ai's lastest iteration & is one of the popular open weight coding models.

If you've had the chance, how does GLM 5.1 (which is now more expensive than MiMo 2.5 Pro after its recent 70% price drop) compare?

eli19d ago

GLM 5.1 is very good. Definitely a contender for best open weight coding model. Nothing like 4.7.

But quite a bit more expensive than MiMo 2.5 Pro. Like 5x to 10x more on my little tests, at least by the API rates.

maxdo20d ago

i tried glm 4.7 for agents that write code. simple scripts 200-1000 LOC. extremely bad . Had to abandon cerebras oferning, their smart models are only on enterprise plan.

jona-f19d ago

glm 4.7 is quite old by now. I don't even use 5.1 anymore, cause I found kimi k2.6, mimi 2.5 pro, deepseek v4 pro and qwen 3.7 all better than glm 5.1

Oras20d ago· 4 in thread

1k TPS is great, but I’m more fascinated by the amount of AI generated comments in this thread!

trollbridge19d ago

Comments at 1,000 TPS is a terrifying future.

0xbadcafebee19d ago

I prefer a thousand smart AI comments to a thousand dumb human comments

1 more reply

eli20d ago

Like what?

adam_arthur19d ago

There are many with subtle tells.

Not nearly as obvious as the ones from 6 months ago, but seems to be more the use of hyperbolic phrasing in a particularly unnatural way.

The assess/explain, then hyperbole at the end kind of structure.

Top comment looks suspicious from this perspective, but it's kind of a losing battle to be able to differentiate them with sufficient accuracy anyway

1 more reply

gertlabs20d ago· 2 in thread

MiMo V2.5 Pro (regular speed) remains the strongest open weights agentic coding model we've tested -- it's been interesting to see how little attention it has received relative to some lower performing releases. And the "fast mode" pricing is very competitive here.

Data at https://gertlabs.com/rankings

unrvl2219d ago

why is deepseek v4 pro a lot lower than flash? where is mimo 2.5?

gertlabs19d ago

DeepSeek v4 Pro struggles with a custom harness, and all the models ranked above it don't, so it gets downweighted in the agentic coding benchmarks (although it ranks better than Flash in one-shot problem solving: https://gertlabs.com/rankings?ow=1&mode=oneshot_coding). We ran plenty of samples.

MiMo v2.5 is on there, as well as the pro version.

We found a few anomalies in our evaluations, which makes sense -- if every new sub-release is better across the board in every area of the model card, that should raise alarms about benchmaxxing. But the main thing we found is that hype != performance, and I trust our benchmark methodology significantly more than the model cards the labs add to their press releases.

2 more replies

holoduke20d ago· 2 in thread

Speed is indeed a next big thing what should happen with LLM frontier models. The possibilities with current models but 1000 times faster would be super useful. Earlier this week it took Claude at least full time a week with two max subscriptions to solve a complex issue where we wanted to mimic a occlusion mapping variant used in the game Crimson Desert. Pretty complex mathematical challenge. With a ultra fast LLM and a proper self verification process it would be awesome.

astlouis4419d ago

Interesting. For your occlusion mapping variant, what engine is the game you're making with made with that you're implementing this for? Do you have Claude hooked up to Unity or Unreal?

MaxikCZ19d ago

Id also be interested in more details as sibling comment. I find that when I try to build stuff, its like building skyscraper from straw. What methods are moving you forward the most?

minraws20d ago· 2 in thread

Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting.

I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.

throwa35626220d ago

Suspect this will be included once out of beta but at a higher credit/token ratio.

Remember, these guys are not VC backed. Anything they do must break even

2 more replies

Qdulf20d ago

Must be Blackwell for native fp4 support.

1 more reply

moffkalast20d ago· 2 in thread

42B active params, sliding window attention. There's your tradeoff.

vlovich12320d ago

Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.

1 more reply

bearjaws20d ago

Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.

npn20d ago· 2 in thread

How?

edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.

though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.

lostmsu20d ago

They say they are using https://github.com/tile-ai/TileRT

- persistent CUDA kernel

- tiled processing with overlapping read/writes

- model designed with specific constraints in mind

1 more reply

2001zhaozhao19d ago

i wonder if it will be possible to hardcode a model with some kind of MTP-adjacent algorithm to use a smaller portion of it to generate most of the tokens but route to the real experts every once in a while to steer it towards good thinking directions. (Perhaps this is done only when it's generating its thinking block, and the training takes it into account)

Could result in very high efficiency and still good intelligence without having to resort to fundamental adjustments like going to a diffusion LLM

1 more reply

harel20d ago· 2 in thread

A few things in life I can't fully grasp why they are so sought after. One is that constant need to exhibit growth. As if being massive and staying as massive is not good enough, one has to always and continuously grow. The other is constant speed increases. We're already operating at 50x speed. My output is much wider and so much faster, I am sometimes my own bottleneck. And now as if that is not enough we want more speed. "I want a full software product from scratch in 12 seconds, Because 5 minute is too long and I got things to do..."

Really?

sidrag2220d ago

different use cases for different people. some people are nurturing a code base and ensuring it doesnt become a gross mess so they become the bottleneck. some people are just trying to prompt stuff into existence and dont know what sql is.

I think this site often overlooks that second group and how large it likely is.

philipkglass20d ago

I remember when I had to wait minutes to get a high resolution image over a dialup connection. When computer and communications hardware advanced enough that I could get 30 high resolution images every second, there were brand new uses. In the case of LLMs, I could imagine that much faster operations allow you to introduce them as parts of systems that need to react to the real world at high speed, like factory equipment. Showing that a model can do the usual LLM tasks at extremely high speed is just a demo proving that the approach works.

2 more replies

digitaltrees19d ago· 2 in thread

Am I the only one that doesn’t care about speed? I want it to not do stupid stuff and to be cheaper.

59nadir19d ago

I prefer faster, dumber models because I provide the intelligence myself and I use them only for things that can be verified pretty easily; they do research (with sources) for me, do certain types of code analysis and code search, boilerplate generation, etc., so a fast model is really key.

I don't have any desire (or think it's a good use of LLMs) to one-shot features because even SotA models are incredibly bad at this. I'm optimizing for what they actually seem to be able to do reliably and pretty well, and I want those things to be done fast so I can get on with things.

1 more reply

Npovview19d ago

Generally thinking tokens are the ones which are verbose. So the speed helps with reducing time for thinking tokens generations and you get your actual output code very fast.

qsera20d ago· 2 in thread

Tokens per seconds is the "Megapixels" of AI marketing!

Octoth0rpe20d ago

I mean, sure, in the sense that they're a real and meaningful number for most of the spectrum on offer, and only gets silly when the number gets too high? There's a pretty big usability difference between 10t/s and 100t/s, and I can imagine similarly for 100->1000. I don't know about > 1000, but let's not pretend that the number is meaningless.

1 more reply

orbital-decay19d ago

Definitely not, there's a ton of potential realtime use cases and high throughput/low TTFT is exactly what they need.

1 more reply

prplfsh20d ago· 1 in thread

This will be really powerful for voice. Being able to reason makes LLM so much smarter but with voice your latency budget is so tight that you can't spare the time typically.

jeffrallen20d ago

This is true for humans too. Lol

maxloh20d ago· 1 in thread

The generation speed in the demo video is crazy, to say the least, and completely beyond my impressions of LLMs.

The Xiaomi team really brought something to the table.

ilaksh19d ago

I think these type of demo videos should allow people to get a sense of super intelligence. Because it's very hard to imagine something that is say three times as smart as you -- by definition you wouldn't be able to comprehend it's thoughts -- but this shows clearly what something that can think 100 times faster than you is like.

1 more reply

GodelNumbering19d ago· 1 in thread

Below is the part I found most interesting

> "However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below"

buildbot19d ago

The 120B and 20B GPT-OSS models by OpenAI did this last year for what it’s worth; the MoEs where MXFP4

_pdp_19d ago· 1 in thread

Do you know what will be cool?

It will be cool to measure models based on their RAW performance and measure them in terms of ROI - not some benchmark but something meaningful like we used this model to solve X.

That will be a massive mind shift and might justify the token expenditure.

HDBaseT19d ago

Aren't benchmarks exactly that?

We used the AI to solve given problem with x% adherence/quality/correctness?

PhunkyPhil19d ago· 1 in thread

Obligatory taalas mention:

https://taalas.com/

Despite the performative UI components they have a shipped (demo) product:

https://chatjimmy.ai/

This is only 3.1 8B and a very small context window, but at 17k tokens per second it's likely enough to reliably call tools which would make a huge difference in agentic applications. Assuming they can bake in better models I'm just as bullish or even moreso on this, considering this opens up edge computing at the extremely low power requirement.

High tok/s is the future IMO.

kilroy12319d ago

My dream is claude or codex running at this speed.

1 more reply

zero052919d ago· 1 in thread

Cool, what is the price pr. Million token. I am using a 300 t/s model for a project I am doing and speed is crucial over precision, so this seems like an upgrade. However if it is 10$ pr. M tokens then it is not worth an upgrade.

GaggiX19d ago

$0.435/$0.87 for the standard speed, this one should be 3 times that.

GaggiX20d ago· 1 in thread

If MiMo v2.5 Pro can run at >1000tk/s on GPUs then I will soon expect the same from OpenAI/Anthropic/Google.

59nadir19d ago

I wouldn't expect any of the american labs to be particularly great (or have much desire) to work on efficiency, they've been consistently proven to be uninterested (if not incapable) of actually improving on those types of things. The closest we've seen lately is that maybe GPT-5.5 (and Opus 4.{7,8}?) are more token-efficient, i.e. they solve things with less tokens...? It hasn't been coupled with any other kind of efficiency bump, though, and we're seeing higher costs anyway in most places where the american labs are involved.

The only players that seem to be capable of a consistent pattern of doing more with less currency are the chinese labs.

bryabaek19d ago· 1 in thread

i tried to test it and after logging in, i get "You don't have access to this event trial" and can't even log out until i clear my cookies. despite having good model, why such a bad website?

girvo19d ago

Same. I also found out that my old Xiaomi account is apparently considered "mainland china" and I can't put any phone number except a chinese one on it lol. I'm not trusting these people with anything that's for sure, useless. I'm australian and have never been to china in my life!

jbellis19d ago· 1 in thread

it is hard to understand what the actually meaningful innovations are here / what TileRT is bringing to the table.

- dflash: new-ish but February is ancient by the standards of the pace of AI innovation lately, I guess applying it to a 1T model is new-ish in the sense that the dflash researchers don't have the hw budget to prove that out - persistent engine kernel: this is like CUDA 101 - warp specialization: I think this just means "keep different gpu resources all busy w/ pipelining" which is CUDA 201, some of it is even baked into pytorch now - MXFP4 QAT: not new - TileRT: hard to tell what this actually does, there's a PyPi wheel with support for DS 3.2 and GLM 5 but binary only

zander_jiang19d ago

tilert is a highly optimized megakernel, its a single kernel that does the entire decode pass, this enables overlapping weight loading with computation, eliminates cuda launch overhead (CUDA graph does not, contrary to what most people think), allows for more fine-grained pipelining. There're lots of blogs/papers on it. Its currently the best approach to maximize memory bandwidth. But megakernels are incredibly hard to optimize, and only work for small batch sizes (low throughput, hence high price), thats why we don't see them much in production.

PhilippGille19d ago

The interesting bits on how they achieved it:

> On the model side, we applied FP4 quantization

> introduced DFlash, an efficient speculative decoding method based on block-level masked parallel prediction

> On the system side, TileRT perfectly adapts to the dynamic characteristics of these algorithms

> 1000+ tokens/s output [...] using just a single standard 8-GPU commodity node

pants219d ago

With a tps and a token price you can calculate approx. price per hour of running the model!

$2.61/M tokens * 1,000 tok/s = $9.40/hr

That would be pretty cheap for an 8-GPU node which would typically run around $45/hr or more. Guess this depends on how many parallel streams it can handle.

linzhangrun17d ago

Maybe it is very suitable for some scenarios like autonomous driving, where reaction speed matters a lot, if they can find a way to put the hardware into a car at an acceptable cost. Maybe this is not impossible, because the hardware cost of current high-end assisted driving is already quite high?

At present, intelligent driving still feels, in general, like a beginner driver who drives mainly by reaction. FSD is a little better. But it still lacks the kind of “spirit” human drivers have. How to say it: when a human driver sees the car in front shaking left and right, he can guess that the driver may not be fully conscious, and then keep away from it. Current assisted driving systems are still quite weak in this kind of understanding of the world.

The most important thing in driving is prediction. But driving itself does not need very deep or very complicated reasoning. Recently I tried using Mimo for development, and I believe the understanding ability it can provide is absolutely more than enough for driving scenarios. Sadly, the Pro version does not have multimodal ability. And this US version seems to be trying to solve the biggest problem of using LLMs in control systems: latency.

Xiaomi’s car is good, but its assisted driving level is near the bottom in the same class. Compared with new EV makers, its route is quite “traditional”, just like comparing lap times with Porsche at the Nürburgring. Xiaomi’s large model team may change this.

sheeshkebab19d ago

Opus regularly bitches and wines to me how long something will take and that I should think before asking it to do it. But then it does it anyway in 15 minutes.

LoganDark19d ago

I was just playing with Cerebras a few days ago because it's the fastest inference provider by far. Unfortunately, the only model anywhere near economical to run that fast is gpt-120b-oss which sucks at Pi's tool calling. So I've been hoping for something faster ever since, especially since my local hardware has a paltry 128GB of unified memory.

Hopefully this pans out and fast models (that are also not ridiculously dumb) become the norm. It's amazing what you can unlock with even a single order of magnitude's speed improvement.

__natty__20d ago

With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput

trollbridge18d ago

If you didn't apply already, you should - they turned around my application in a day.

This thing is seriously fast and was good enough to switch it in for the other model I was using. I tried it for both planning, executing, and subagent tasks and it performed adequately in all 3.

So, this is another one to add to the list next to DeepSeek-V4-Pro and Qwen-3.7-Max...

desireco4219d ago

I didn't use their pro speed but regular Mimo-v2.5, not even pro, it seems really fast. I have plenty of tokens and subscriptions but this is really impressive. I really don't need another one, but I am tempted simple because it works so fast, can't imagine how this fast service can be.

temikus19d ago

I’ve personally found MiMo models a hit and miss. I have some personal agentic projects and I found them to hallucinate hard at least 10% of the time. And do so in pretty sinister ways - making up people, names, places, etc. I switched back to Kimi for now.

RachelF19d ago

I wonder how fast it performs on just a CPU? If the model performs say 10x on a GPU cluster, would it also perform faster on a CPU?

This could bring proper desktop AI to the average laptop user, which could be a game changer for running local models.

Frannky19d ago

I tried this model it was pretty bad at coding. Maybe it was me. 1k tokens/sec pretty cool tho. Deepseek V4 pro is better. I wonder tweak pi + deepseek pro v4+ 1k tokens/sec if would actually be better than Claude code

slopinthebag20d ago

I hope this is the next frontier AI labs push. Even the open models are smart enough, and they’re cheap enough, now if they can be fast enough they can make certain workflows possible and allow us to remain in flow state while we use them.

overgard19d ago

Pretty cool, although I can't help but think this would be a very easy to way rack up a GARGANTUAN bill. That company that blew 500 million on Claude in a month might have competition soon..

megous18d ago

This just means you can blow through monthly budget in 1h instead of in 4h on the cheapest plan. :)

h14h20d ago

The gated "ultra-speed" phenomenon seen here and with the Cerebras Kimi K2.6 release, while understandable, is somewhat troubling IMO.

Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.

elar_verole20d ago

Yeah, this seems to be the easiest path for overall agents efficiency in the short term

pullshark9120d ago

It's interesting but not game-changing IMO. Speed here is not a bottleneck.

isusmelj19d ago

No note about the specific GPU they use. One might speculate. B200? H200? H100?

kopirgan19d ago

Will this list for trillion dollar valuation as well?

yanhangyhy19d ago

have anyone give it a try? even in china, it's not popular...but xiaomi is really good at make price go down on everything...

mrwaffle19d ago

What a ripoff you have to make an account then 'apply' to try this demo.

trilogic20d ago

Pfff time wasting. 1 password between 8-16 characters, and this and that... What??? 2 Captcha after captcha, come on 3 Service unavailable This service is not available in your region yet.

Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.

ljlolel19d ago

Can try it now in seconds on https://trustedrouter.com/

aburayhanalif19d ago

it is good i think

siddbudd19d ago

to try the demo you need to sign up. why? to sign up you need a password 8-16 chars. Why limit at 16? geez, I hate Chinese IT companies with a passion.

update: AFTER signing up, and only then, am I told: 'This service is not available in your region yet.'

m00dy20d ago

boom!

0xbadcafebee19d ago

This is the value prop of Groq and Cerebras. They don't have the best models, but they have the fastest inference, and Groq has both the lowest cost and fastest speed.

wartywhoa2319d ago

An exercise for the near future:

Albert has a chalet in swiss alps and an uncles' fortune, burning tokens at 11 kHz.

Joe has a rental capsule and a UBI, burning equally priced tokens at 23kHz.

Who's the first to solve the problem of maniacs in power?

j / k navigate · click thread line to collapse

489 comments

233 comments · 55 top-level

goyozi20d ago· 40 in thread

flexagoon20d ago

throwaway6767819d ago

Agent mania setting in

5 more replies

SwellJoe19d ago

1 more reply

RussianCow19d ago

Do you mean Flash and not Pro? I haven't tried it personally, but according to OpenRouter, the fastest DeekSeep V4 Pro providers are only ~50tps. That's slower than Claude Opus.

https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...

6 more replies

binary001019d ago

I exclusively use deepseek v4 flash now, completely stopped using slow models like Claude.

znpy19d ago

> I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

throw-the-towel19d ago

FWIW, for me just today it got itself into silly rabbit holes twice, and both times I had to fix things myself. Scarily, this is something I catch myself doing as well.

abustamam19d ago

Take the nap anyway, just say it took all afternoon :)

tmaly19d ago

This reminds me of the Peter / Boris comments on writing loops to keep the agents busy.

behnamoh19d ago

Same. How can DeepSeek serve the V4-Pro at such high speeds despite the sanction?

1 more reply

andai19d ago

With Flash it's basically instant for smaller tasks, yeah.

switchbak19d ago

Now the next bottleneck is the compiler - which we can model in an LLM! It's only wrong 15% of the time :)

lhoff19d ago

Have you tried https://chatjimmy.ai/ it’s only a demo but it blew my mind. I had the sudden feeling that this is the future.

1 more reply

skybrian19d ago

If we get low enough latency, there's no reason to multitask. You can ask it to do one thing at a time and immediately see what it did. That's a nice way to work!

However, a faster llm isn't enough. You also need fast compiles and fast tests.

coderbants19d ago

It cuts both ways. Sometimes I ask Gemini 3.5 Flash to do something for me and it kicks it out almost instantly and it works great, and it's a bit scary how quickly it can do that.

bendangelo19d ago

I use planning mode in opencode. It has a prompt to tell it to plan it out etc. Then I execute with a smaller model. it works well

dkersten19d ago

I’ve been playing around with groq and GPT OSS which they run at 1000 TPS (20B) or 800 TPS (120B) and the speed feels quite magical.

I haven’t tried cerebras’ 3000 TPS yet but I did try the demo of that 15,000 TPS model whose name escapes me right now.

I’m not sure if it makes a meaningful difference for my actual work, but it sure is amazing to watch it generate a screen full of text in the blink of an eye.

robberth19d ago

https://chatjimmy.ai/ ?

2 more replies

ayewo19d ago

> I haven’t tried cerebras’ 3000 TPS yet but I did try the demo of that 15,000 TPS model whose name escapes me right now.

You were likely thinking of AI accelerator startup Taalas.

Previous HN discussion: https://news.ycombinator.com/item?id=47086181

ipkstef20d ago

asking for curiosities sake. What kind of PR loop are you running that takes a few hours?

ketzo20d ago

not OP but usually for me this means long verification loop; waiting 10min on CI checks, that kind of thing, rather than actual 1hr wall clock of token generation

2 more replies

goyozi19d ago

1 more reply

pianopatrick20d ago

We fit in for the things that are not artificial.

So long as AI lives in server farms, humans will be needed for tasks in the physical world.

It's only if we combine AI with robots that things get really dicey.

fartfeatures20d ago

4 more replies

efromvt19d ago

(I should go measure this now, I'm curious)

noisy_boy19d ago

We need to really worry when we get amazing results very fast.

HarHarVeryFunny19d ago

There can't be many normal use cases where there'd be any cost benefit.

fragmede19d ago

1 more reply

cman144419d ago

Reminds me of the doherty threshold. When will AI respond in less than 400 milliseconds?

OtomotO19d ago

> That’s a game changer and I don’t even know where we fit in.

Doing non trivial work.

lukan19d ago

"I don’t even know where we fit in."

Giving directions and verifying its output? But my mental capacity is still limited. I can make way more prompts, than I can read code.

binyu19d ago

> Right now Claude is faster than me on some tasks but we’re at least close.

I dont doubt it, but I don't think you can spawn 10 copies of yourself working simultaneously.

AlecSchueler19d ago

No, but nor can you keep track of what 10 agents are doing simultaneously. Hence the multitasking regret.

1 more reply

ilaksh19d ago

Use Claude fast mode and turn off thinking. Tell it to just explain what it's plan is to you at a high level.

It will go much faster.

UncleOxidant19d ago

Have you tried Gemini 3.5 Flash? It's quite fast. Amazing how fast it finishes tasks. Much faster than Claude.

giancarlostoro19d ago

recroad20d ago

Woah - what’s the prompt and what’s the PR?

goyozi19d ago

I replied in more detail under another comment. TLDR: fixing flaky CI across multiple branches

fnordpiglet19d ago

Bombthecat19d ago

Living on the street or cave lol

dakiol19d ago· 34 in thread

So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!

powerapple19d ago

erikus19d ago

Agents completing work faster would certainly help me as well since I also find context switching exhausting above some threshold.

Build and test would move back into the critical path, though, and for some projects that will take effort to bring down.

ttoinou19d ago

In which world do you live where employees work 8 hours per day ? They clock 8 hours per day maybe, but they don't work that time

drob51819d ago

1 more reply

mettamage19d ago

I agree with you.

I am on Dutch subreddits a lot, to get a local pulse and not to be too HN minded.

A lot of them would have vilified you by now. Some even would have even questioned your morality.

Again, I agree with you. But clearly not everyone has this view.

mystifyingpoi19d ago

Generally, when people say they are working 8h/day, they don't literally mean it. Even "work" is basically impossible to define for a SWE.

dakiol19d ago

1 more reply

ai_slop_hater19d ago

Some companies force you to actually work 8 hours a day. It’s hell.

1 more reply

opsnooperfax19d ago

dilyevsky19d ago

erikus19d ago

1 more reply

andai19d ago

Yeah, nobody is under any pressure to work even faster than before. I don't know what everyone is complaining about!

fragmede19d ago

pmontra19d ago

If you start the AI on something big and come back after one hour then yes, you might discover that you wasted an hour and got nothing.

schipperai19d ago

You can dig deeper into problems with AI. For me, it supplements my knowledge in domains I don’t fully understand. It also helps me learn. So I can tackle problems I wouldn’t otherwise.

I’m excited for ultrafast AI. It likely means less temptation to multi-thread and deeper flow in single sessions.

8note19d ago

how do you know that it is actually suggesting the right thing?

3 more replies

yogthos19d ago

enraged_camel19d ago

logicchains19d ago

>instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.

himata411319d ago

vanuatu19d ago

1 more reply

noncoml19d ago

You have to think LLM as the genie that tries to trick you.

First make it write a contract (REQ/ARCH/IMPL documents). Skim through those for any mistakes.

Then based on those ask it to write tests. Again skim through them.

Now you have a context full of guardrails. It’s less likely to surprise you.

petesergeant19d ago

I find a second LLM can do this at least as well as I can, usually, and just ask the harness to surface anything they can't agree on.

DenisM19d ago

> with the hope of it giving you the right answer with the right prompt.

Consider that our ability to evaluate quality of the output is falling further behind our ability to produce it. The “right answer” is not the most likely outcome.

drschwabe19d ago

__david__19d ago

Only if your personal projects make you money. I have a million hobby projects but none generate income.

IncreasePosts19d ago

jorl1719d ago

Also, with the added speed I can produce things more in line with the quality I’ve always wanted to add (many more tests, for example).

fullstop19d ago

It's making things less fun, for me at least.

linsomniac19d ago

Odd, I'm having the opposite experience.

The thing I really love about working with computers is when I achieve something. That's the thing that makes me figuratively, and sometimes literally, throw my fists into the air and go "Yeaaah!"

With the AI tooling, I'm getting those more like a couple times a week.

Plus, I'm using AI to attack the things in my day that are "a drag", and getting them done too.

The highs are more frequent and the lows are not so low.

2 more replies

overgard19d ago

vanuatu19d ago

Employees who get paid a flat rate per hour don't have the incentive to do more than their job

Equity / profit sharing should be commonplace in the age of AI.

marknutter19d ago

I dunno man, the slot machine pays out like 99% of the time for me.

alfalfasprout19d ago

serpix20d ago· 26 in thread

I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one.

Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.

alkyon20d ago

Sounds like exponential growth of crappy software. I'm not saying that before we didn't have mass produced crap in SE, but now it will turn into explosive overflow.

cdata20d ago

5 more replies

epolanski20d ago

I am more and more inclined into not believing this crappy software theory.

Especially as teams invest in proper agentic harnessing.

Mind you, I don't like AI and I think it ruined the job, I don't like working this way, it's exhausting, way more work on one side, way less fun and fiddling with technical parts.

2 more replies

kajman19d ago

vitalyan123420d ago

2 more replies

solenoid093720d ago

Crap is fine if it gets the job done. I think software as an industry will change to more ephemeral construction.

2 more replies

eunos19d ago

You could say the same when higher level languages getting popular. Previously programming was the domain of Math, Physics, EE doctorates. These days we even have a few months coding bootcamp

9cb14c1ec020d ago

Anyone remember the old days when a new frontend framework came out every 3 months. That has pretty much stopped. No one cares anymore.

asveikau20d ago

> when a new frontend framework came out every 3 months.

> No one cares anymore.

I never cared about this.

LASR20d ago

Oh you wait until LLMs come up with frameworks that allow multiple LLMs to collaborate effectively. Then you’ll have new frameworks every 3 days.

mountainriver20d ago

It’s even discouraged now as LLMs wouldn’t have the documentation built in

1 more reply

ecshafer19d ago

New front end frameworks came out every 3 months, but realistically no one was using anything that wasn't made by Facebook, Google, or Evan You.

greenavocado19d ago

That's because I roll my own frontend framework for each project and every week for existing projects /s

ilaksh19d ago

dakiol19d ago

andriy_koval19d ago

> And if the product is good it will succeed.

it needs to win marketing landscape, hyper-overcrowded by thousands of competitors, slop-gened over weekend.

1 more reply

oulipo220d ago

unglaublich20d ago

And for any non-trivial application, the space of possibilities grows so quick that you'll never even be able to _touch_ all the moving parts of the application and verify them.

unshavedyak19d ago

> Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.

I have a more hopeful take. As AIs improve and get faster we can more quickly and iteratively improve code which we may have historically avoided due to the work involved.

In general AI is going to enable humanity to be more extreme versions of itself. For good and bad. I suspect more bad than good, though.

tmaly19d ago

Our bottleneck is going to be verification.

lionkor20d ago

And they will all suck! I can't wait.

visarga19d ago

> We are going to get near instant software from prompt, multiple ones and then choose the best one.

If you extract the spec from first implementation and reimplement from scratch you get a free testing oracle. Where they diverge you send the agent to decide which one had a bug.

unglaublich20d ago

And how are you going to determine which is the best? Going through all the possible combinations of users and usage? So mostly it shifts the work from generation to validation.

sagarp20d ago

The models might be so fast that they can autocomplete your prompt before you even finish it, and generate dozens of possible applications before you're even done asking.

Paradigma1119d ago

How do you get all the build system scripts/tests.... to run instantly?

andai19d ago