It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.) and when you tell it to actually do those changes it's pretty much done in half an hour
https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...
Basically I never have to wait - yes I have to tell it little corrections occasionally (but I know the domain really well so that's not an issue), but it's so much faster than anything else it's kinda crazy. I love the super fast speeds with high involvement development cycle.
I actually enjoy using agentic development flows for the first time now - whereas with Claude I absolutely hated it. That 5 to 20 min wait after every prompt absolutely killed my desire to even want to work at all.
the way software engineering works these days reminds me a lot of factory workers on production lines that just sit in front of a production line all day and take out faulty items and/or perform a single step in the production of goods.
But truly, using Cerebras at ~2k tokens/s, with very low latency is like a vision into the future. You start to rework your workflow around things that can happen without onerous manual review - stating the conditions for success, etc. It's rare that I have a problem that maps well to that, but I expect this is where things are headed.
Of course the fast models tend to not be the SOTA ones, but if that was the case - high quality and near-instant thinking, that's a game changer that I don't think we're really prepared for. The things that get unlocked with higher-than-reasonable speed become very interesting.
This is normal interactive UI for tasks that aren't compute-intensive. Programs spend most of their time idle, waiting for us to click a button. We shouldn't be waiting for them or spinning more plates to keep them busy.
However, a faster llm isn't enough. You also need fast compiles and fast tests.
Then I ask it to do something else and it goes off-road and where I used to be able to interject with a "wow wow wow, that's not right", by the time I see the text on screen and react it's already made massive changes. Short of making it commit between every edit it's hard to prevent it from going wrong as quickly as it goes right (and even then, it can make a boo-boo on a remote API too depending on how much privilege it has).
I haven’t tried cerebras’ 3000 TPS yet but I did try the demo of that 15,000 TPS model whose name escapes me right now.
I’m not sure if it makes a meaningful difference for my actual work, but it sure is amazing to watch it generate a screen full of text in the blink of an eye.
I do think it’s super useful for rubbing little validation checks like showing it a diff to ensure that the changes are on task, and being able to do those quicker really helps because it means you can do many focused checks without them getting in the way.
You were likely thinking of AI accelerator startup Taalas.
Previous HN discussion: https://news.ycombinator.com/item?id=47086181
So long as AI lives in server farms, humans will be needed for tasks in the physical world.
It's only if we combine AI with robots that things get really dicey.
(I should go measure this now, I'm curious)
We need to really worry when we get amazing results very fast.
There can't be many normal use cases where there'd be any cost benefit.
It's a cute toy right now, but you can tell an LLM that it's an http server, and have it respond directly to a web browser hitting it. It generates headers in response, as well as page contents. As 1000 tok/sec becomes three new normal, we will come up with newer ways to use it outside of toy fiction encyclopedias.
Doing non trivial work.
Giving directions and verifying its output? But my mental capacity is still limited. I can make way more prompts, than I can read code.
I dont doubt it, but I don't think you can spawn 10 copies of yourself working simultaneously.
It will go much faster.
So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!
Build and test would move back into the critical path, though, and for some projects that will take effort to bring down.
I am on Dutch subreddits a lot, to get a local pulse and not to be too HN minded.
A lot of them would have vilified you by now. Some even would have even questioned your morality.
Again, I agree with you. But clearly not everyone has this view.
It also makes me think about the temptation to stop thinking with these tools, i.e. "cognitive surrender". Addy Osmani wrote a nice blog post about this: https://addyosmani.com/blog/cognitive-surrender
If you start the AI on something big and come back after one hour then yes, you might discover that you wasted an hour and got nothing.
I’m excited for ultrafast AI. It likely means less temptation to multi-thread and deeper flow in single sessions.
If you're treating it like a slot machine you're doing it wrong. It will give you exactly what you ask for if you ask clearly, i.e. write a clear, detailed specification, not just "do X!". The nondeterminism comes from vagueness in specification.
First make it write a contract (REQ/ARCH/IMPL documents). Skim through those for any mistakes.
Then based on those ask it to write tests. Again skim through them.
Now you have a context full of guardrails. It’s less likely to surprise you.
Consider that our ability to evaluate quality of the output is falling further behind our ability to produce it. The “right answer” is not the most likely outcome.
Also, with the added speed I can produce things more in line with the quality I’ve always wanted to add (many more tests, for example).
The thing I really love about working with computers is when I achieve something. That's the thing that makes me figuratively, and sometimes literally, throw my fists into the air and go "Yeaaah!"
With the AI tooling, I'm getting those more like a couple times a week.
Plus, I'm using AI to attack the things in my day that are "a drag", and getting them done too.
The highs are more frequent and the lows are not so low.
Equity / profit sharing should be commonplace in the age of AI.
Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).
Especially as teams invest in proper agentic harnessing.
We have had a champion in our team that has invested a lot of time into it over the last 4 months, and if anything, quality has improved, not decreased. Architecture is more coherent, codebase has been cleaned up, agents find information quickly, code produced is very solid and my role is more and more checking that the output meets the requirements. But I cannot confidently say that I would've done a better job than AI more often than not I have to admit it does a better job than mine.
The mistakes are less and less technical and merely in the domain mapping. And AI is still not creative as I am for finding solutions quickly to unlock stakeholders' issues. Also, AI is still not creative as I am for finding the proper solutions for advanced technical problems. But it does a better job than me, even on that front, one shotting few solutions in a fraction of a time it would've taken me to test one idea myself.
Mind you, I don't like AI and I think it ruined the job, I don't like working this way, it's exhausting, way more work on one side, way less fun and fiddling with technical parts.
And yet, I have the genuine belief that few years from now we'll be cloning open source repositories that are already optimized/harnessed and tested for agentic loops and best practices left and right with software engineers mostly overseeing the domain translation and putting their 2 cents on the non-boilerplatey parts of the product (which, in general, are a small part of the surface).
I think that the next years of my career will be mostly spent in setting up and writing the harnessing and domain mapping part. Then I will move to another sector, not because I necessarily believe I won't have a job, but because I want to vomit thinking that's going to be my job.
> No one cares anymore.
I never cared about this.
I think this captures something that I've been searching for the words for. (Maybe I should have gotten an LLM to write the words for me.) Some of the biggest AI boosters are the kind of dev that would have cared about the new frameworks of the last 3 months. They had a "the framework does all the thinking for me" attitude already, so it is easy for AI to slot into that.
It's going to skip the code entirely for small businesses and just render UIs straight from context data and prompts at interactive speeds. Kind of like Google's Genie does with games but much more accurately.
it needs to win marketing landscape, hyper-overcrowded by thousands of competitors, slop-gened over weekend.
I have a more hopeful take. As AIs improve and get faster we can more quickly and iteratively improve code which we may have historically avoided due to the work involved.
I know i've made several refactors that would have otherwise been insane lifts. Not only because the work involved but because sometimes you don't know if it will work, and so you have a sort of double friction; you don't know if it will even succeed. With an AI you can just throw it at the refactor to see if it runs into a problem all while you're having a coffee break or w/e.
In general AI is going to enable humanity to be more extreme versions of itself. For good and bad. I suspect more bad than good, though.
If you extract the spec from first implementation and reimplement from scratch you get a free testing oracle. Where they diverge you send the agent to decide which one had a bug.
VibeOS — Fully Hallucinated Operating System
I genuinely don't understand what moat these US model labs have. If they're saying recursive self improvement is just around the corner and Chinese labs are only slightly behind the leading US models, what moat does the US labs have? Are the US models going to recursively self improve better than the Chinese open source ones or something?
I might be completely wrong about this, but if I had money in OpenAI or Anthropic I'd be pulling it all right now. I think the chance of them going to near-zero over the next few years is very significant.
Or Google. I'm working with multiple customers right now that are very pissed at Google for deprecating Gemini 2.5 Flash, canning the GA release of 3.0 Flash and now have to decide whether to bite the bullet of the 5x price increase for 3.5 Flash or switching providers. Quite a few of them will likely fully pivot to open models.
i've a Github copilot yearly subscription. Microsoft recently changed their billing to based on token. i'm still getting billed per premium request but GPT 5.4 is now 6x compare to 1x before.
For non subsidized plans? Pretty sure they'd need to put this in ToS, or law suites would have followed by now.
On HN China is seen as a cheap labor copycat. This used to be a fair approximation at some point in the past. In my opinion China is getting ahead of everyone else much more than US used to be.
SF is a beautiful thing in the US, vast power and wealth comes from there. Smart people collaborating communicating and building fast and with excitement. China did SF kind of thing for many different sectors in many different places.
The $0.87/M tokens price for Mimo Pro is probably subsidized.
Mimo models aren't widely available on western providers, but Kimi and Deepseek are similar sizes and cost about the same to run. They are priced $3-$4/M tokens (which is right were Google's very confused range of Flash models are priced at: between $0.40/M tokens and $9/M tokens depending on exactly which model - and you don't want the $9 one!).
Anthropic overprices Sonnet (probably because of their capacity issues). GPT 5.4 mini is $4.50/M tokens.
It was pretty clear the USA won World War 2 because it out produced and out innovated everyone else. Probably with that in mind, after World War 2 the USA adopted the "Vannevar Bush" model, summarised in this picture: https://www.researchgate.net/figure/annevar-Bushs-Science-th... The idea is to jump start R&D through public funding. The hoped for outcome was that R&D feed private enterprise, leading to a productivity boom.
The boom happened, and the USA did seem to out-compete everybody else in R&D, science, and the products they delivered for decades after that.
That way of doing things seems to have faded over time in the USA. The decline seemed to coincide with the rise of Neo-econmics, and now of course it's been obliterated by Trump. He's very keen to fund Intel to produce chips in a year or two's time (which is something the stock market and banks do perfectly well), but funding basic science is getting drastic cuts.
Still other countries noticed the rise of the USA, and some adopted similar funding models for basic R&D. China seems to have picked it up with gusto, both subsidising R&D and STEM training, leading to huge numbers of engineers and scientists. Whether it will lead to an economic boom remains unknown, but acceleration of ideas and innovations coming out of China seems undeniable. More recently, Ukraine showered its local engineering garages with funds in the hopes of getting a similar outcome to the USA in WW2. It looks like it worked. If the Iran war continues, it's entirely possible arms trade will reverse: the USA could well start buying drones off Ukraine.
It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.
Anything different for Grok?
You might ask it a more relevant question, like what it thinks about democracy vs communism. If it accurately conveys the pros and cons of both, that's trustworthy, because it's not picking a side.
Given the export restrictions this could mean they need to prioritise how to best use their limited hardware. But they could also be moving to Huawei GPUs like deepseek did and simply not have stable hardware or software for a large scale deployment yet.
This is just speculation based on the MXFP4 support on Huawei GPUs that is lacking on some nvidia GPUs.
I think the answer is that there's a tradeoff here where additional throughput for a single person can be achieved only by tying up more resources than a normal request would, even when you take into account the fact that the normal request takes longer to finish. I'm not an expert, but some of the optimizations they describe, particularly the parallel prediction stuff, sound like they could take up extra resources.
It is another thing the BigLabs accuse open weight models of benefiting from distillation & other techniques & essentially avoid higher training costs (which typically bleed into bills end users pay for inference).
Ex A: https://www.anthropic.com/research/2028-ai-leadership
Ex B: https://www.reuters.com/world/china/openai-accuses-deepseek-...
For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.
MiMo 2.5 is not the same model as MiMo 2.5 Pro.
GLM 5.1 is z.ai's lastest iteration & is one of the popular open weight coding models.
If you've had the chance, how does GLM 5.1 (which is now more expensive than MiMo 2.5 Pro after its recent 70% price drop) compare?
But quite a bit more expensive than MiMo 2.5 Pro. Like 5x to 10x more on my little tests, at least by the API rates.
Not nearly as obvious as the ones from 6 months ago, but seems to be more the use of hyperbolic phrasing in a particularly unnatural way.
The assess/explain, then hyperbole at the end kind of structure.
Top comment looks suspicious from this perspective, but it's kind of a losing battle to be able to differentiate them with sufficient accuracy anyway
Data at https://gertlabs.com/rankings
MiMo v2.5 is on there, as well as the pro version.
We found a few anomalies in our evaluations, which makes sense -- if every new sub-release is better across the board in every area of the model card, that should raise alarms about benchmaxxing. But the main thing we found is that hype != performance, and I trust our benchmark methodology significantly more than the model cards the labs add to their press releases.
I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.
Remember, these guys are not VC backed. Anything they do must break even
edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.
though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.
- persistent CUDA kernel
- tiled processing with overlapping read/writes
- model designed with specific constraints in mind
Could result in very high efficiency and still good intelligence without having to resort to fundamental adjustments like going to a diffusion LLM
Really?
I think this site often overlooks that second group and how large it likely is.
I don't have any desire (or think it's a good use of LLMs) to one-shot features because even SotA models are incredibly bad at this. I'm optimizing for what they actually seem to be able to do reliably and pretty well, and I want those things to be done fast so I can get on with things.
The Xiaomi team really brought something to the table.
> "However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below"
It will be cool to measure models based on their RAW performance and measure them in terms of ROI - not some benchmark but something meaningful like we used this model to solve X.
That will be a massive mind shift and might justify the token expenditure.
We used the AI to solve given problem with x% adherence/quality/correctness?
Despite the performative UI components they have a shipped (demo) product:
This is only 3.1 8B and a very small context window, but at 17k tokens per second it's likely enough to reliably call tools which would make a huge difference in agentic applications. Assuming they can bake in better models I'm just as bullish or even moreso on this, considering this opens up edge computing at the extremely low power requirement.
High tok/s is the future IMO.
The only players that seem to be capable of a consistent pattern of doing more with less currency are the chinese labs.
- dflash: new-ish but February is ancient by the standards of the pace of AI innovation lately, I guess applying it to a 1T model is new-ish in the sense that the dflash researchers don't have the hw budget to prove that out - persistent engine kernel: this is like CUDA 101 - warp specialization: I think this just means "keep different gpu resources all busy w/ pipelining" which is CUDA 201, some of it is even baked into pytorch now - MXFP4 QAT: not new - TileRT: hard to tell what this actually does, there's a PyPi wheel with support for DS 3.2 and GLM 5 but binary only
> On the model side, we applied FP4 quantization
> introduced DFlash, an efficient speculative decoding method based on block-level masked parallel prediction
> On the system side, TileRT perfectly adapts to the dynamic characteristics of these algorithms
> 1000+ tokens/s output [...] using just a single standard 8-GPU commodity node
$2.61/M tokens * 1,000 tok/s = $9.40/hr
That would be pretty cheap for an 8-GPU node which would typically run around $45/hr or more. Guess this depends on how many parallel streams it can handle.
At present, intelligent driving still feels, in general, like a beginner driver who drives mainly by reaction. FSD is a little better. But it still lacks the kind of “spirit” human drivers have. How to say it: when a human driver sees the car in front shaking left and right, he can guess that the driver may not be fully conscious, and then keep away from it. Current assisted driving systems are still quite weak in this kind of understanding of the world.
The most important thing in driving is prediction. But driving itself does not need very deep or very complicated reasoning. Recently I tried using Mimo for development, and I believe the understanding ability it can provide is absolutely more than enough for driving scenarios. Sadly, the Pro version does not have multimodal ability. And this US version seems to be trying to solve the biggest problem of using LLMs in control systems: latency.
Xiaomi’s car is good, but its assisted driving level is near the bottom in the same class. Compared with new EV makers, its route is quite “traditional”, just like comparing lap times with Porsche at the Nürburgring. Xiaomi’s large model team may change this.
Hopefully this pans out and fast models (that are also not ridiculously dumb) become the norm. It's amazing what you can unlock with even a single order of magnitude's speed improvement.
This thing is seriously fast and was good enough to switch it in for the other model I was using. I tried it for both planning, executing, and subagent tasks and it performed adequately in all 3.
So, this is another one to add to the list next to DeepSeek-V4-Pro and Qwen-3.7-Max...
This could bring proper desktop AI to the average laptop user, which could be a game changer for running local models.
Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.
Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.
update: AFTER signing up, and only then, am I told: 'This service is not available in your region yet.'
Albert has a chalet in swiss alps and an uncles' fortune, burning tokens at 11 kHz.
Joe has a rental capsule and a UBI, burning equally priced tokens at 23kHz.
Who's the first to solve the problem of maniacs in power?