undefined | Better HN

0 pointsvessenes1mo ago0 comments

This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:

1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.

2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.

3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.

However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.

Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.

0 comments

rbanffy1mo ago

> any factor of 10 being a new science / new product category,

I often remind people two orders of quantitative change is a qualitative change.

> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.

While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.

I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.

ttul1mo ago

Having dabbled in VLSI in the early-2010s, half the battle is getting a manufacturing slot with TSMC. It’s a dark art with secret handshakes. This demonstrator chip is an enormous accomplishment.

vessenesOP1mo ago

Yeah and a team I’m not familiar with — I didn’t check bios but they don’t lead with ‘our team made this or that gpu for this or that bigco’.

The design ip at 6nm is still tough; I feel like this team must have at least one real genius and some incredibly good support at tsmc. Or they’ve been waiting a year for a slot :)

1 more reply

VagabundoP1mo ago

There might be a foodchain of lower order uses when they become "obsolete".

rbanffy1mo ago

I think there will be a lot of space for sensorial models in robotics, as the laws of physics don't change much, and a light switch or automobile controls have remained stable and consistent over the last decades.

Gareth3211mo ago

I think the next major innovation is going to be intelligent model routing. I've been exploring OpenClaw and OpenRouter, and there is a real lack of options to select the best model for the job and execute. The providers are trying to do that with their own models, but none of them offer everything to everyone at all times. I see a future with increasingly niche models being offered for all kinds of novel use cases. We need a way to fluidly apply the right model for the job.

nylonstrung1mo ago

Agree that routing is becoming the critical layer here. Vllm iris is really promising for this https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html

There's already some good work on router benchmarking which is pretty interesting

condiment1mo ago

At 16k tokens/s why bother routing? We're talking about multiple orders of magnitude faster and cheaper execution.

Abundance supports different strategies. One approach: Set a deadline for a response, send the turn to every AI that could possibly answer, and when the deadline arrives, cancel any request that hasn't yet completed. You know a priori which models have the highest quality in aggregate. Pick that one.

IanCal1mo ago

The best coding model won’t be the best roleplay one which won’t be the best at tool use. It depends what you want to do in order to pick the best model.

1 more reply

monooso1mo ago

I came across this yesterday. Haven't tried it, but it looks interesting:

https://agent-relay.com/

ssivark1mo ago

> speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious

Can we use older (previous generation, smaller) models as a speculative decoder for the current model? I don't know whether the randomness in training (weight init, data ordering, etc) will affect this kind of use. To the extent that these models are learning the "true underlying token distribution" this should be possible, in principle. If that's the case, speculative decoding is an elegant vector to introduce this kind of tech, and the turnaround time is even less of a problem.

btown1mo ago

For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?

Zetaphor1mo ago

My understanding as well is that speculative decoding only works with a smaller quant of the same model. You're using the faster sampling of the smaller models representation of the larger models weights in order to attempt to accurately predict its token output. This wouldn't work cross-model as the token probabilities are completely different.

jasonjmcghee1mo ago

This is not correct.

Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.

It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc

The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.

1 more reply

ashirviskas1mo ago

Smaller quant or smaller model?

Afaik it can work with anything, but sharing vocab solves a lot of headaches and the better token probs match, the more efficient it gets.

Which is why it is usually done with same family models and most often NOT just different quantizations of the same model.

1 more reply

vessenesOP1mo ago

I think they’d commission a quant directly. Benefits go down a lot when you leave model families.

jasonwatkinspdx1mo ago

They may be using Rapidus, which is a Japanese government backed foundry built around all single wafer processing vs traditional batching. They advertise ~2 month turnaround time as standard, and as short as 2 weeks for priority.

empath751mo ago

Think about this for solving questions in math where you need to explore a search space. You can run 100 of these for the same cost and time of doing one api call to open ai.

joha42701mo ago

The guts of a LLM isn't something I'm well versed in, but

> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?

connorbrinton1mo ago

Speculative decoding takes advantage of the fact that it's faster to validate that a big model would have produced a particular sequence of tokens than to generate that sequence of tokens from scratch, because validation can take more advantage of parallel processing. So the process is generate with small model -> validate with big model -> then generate with big model only if validation fails

More info:

* https://research.google/blog/looking-back-at-speculative-dec...

* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...

sails1mo ago

See also speculative cascades which is a nice read and furthered my understanding of how it all works

https://research.google/blog/speculative-cascades-a-hybrid-a...

speedping1mo ago

Verification is faster than generation, one forward pass for verification of multiple tokens vs a pass for every new token in generation

vanviegen1mo ago

I don't understand how it would work either, but it may be something similar to this: https://developers.openai.com/api/docs/guides/predicted-outp...

cma1mo ago

When you predict with the small model, the big model can verify as more of a batch and be more similar in speed to processing input tokens, if the predictions are good and it doesn't have to be redone.

ml_basics1mo ago

They are referring to a thing called "speculative decoding" I think.

j / k navigate · click thread line to collapse

0 pointsvessenes1mo ago0 comments

This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:

I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.

0 comments

rbanffy1mo ago

> any factor of 10 being a new science / new product category,

I often remind people two orders of quantitative change is a qualitative change.

I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.

ttul1mo ago

Having dabbled in VLSI in the early-2010s, half the battle is getting a manufacturing slot with TSMC. It’s a dark art with secret handshakes. This demonstrator chip is an enormous accomplishment.

vessenesOP1mo ago

Yeah and a team I’m not familiar with — I didn’t check bios but they don’t lead with ‘our team made this or that gpu for this or that bigco’.

The design ip at 6nm is still tough; I feel like this team must have at least one real genius and some incredibly good support at tsmc. Or they’ve been waiting a year for a slot :)

1 more reply

VagabundoP1mo ago

There might be a foodchain of lower order uses when they become "obsolete".

rbanffy1mo ago

Gareth3211mo ago

nylonstrung1mo ago

Agree that routing is becoming the critical layer here. Vllm iris is really promising for this https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html

There's already some good work on router benchmarking which is pretty interesting

condiment1mo ago

At 16k tokens/s why bother routing? We're talking about multiple orders of magnitude faster and cheaper execution.

IanCal1mo ago

The best coding model won’t be the best roleplay one which won’t be the best at tool use. It depends what you want to do in order to pick the best model.

1 more reply

monooso1mo ago

I came across this yesterday. Haven't tried it, but it looks interesting:

https://agent-relay.com/

ssivark1mo ago

btown1mo ago

Zetaphor1mo ago

jasonjmcghee1mo ago

This is not correct.

Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.

It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc

The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.

1 more reply

ashirviskas1mo ago

Smaller quant or smaller model?

Afaik it can work with anything, but sharing vocab solves a lot of headaches and the better token probs match, the more efficient it gets.

Which is why it is usually done with same family models and most often NOT just different quantizations of the same model.

1 more reply

vessenesOP1mo ago

I think they’d commission a quant directly. Benefits go down a lot when you leave model families.

jasonwatkinspdx1mo ago

empath751mo ago

Think about this for solving questions in math where you need to explore a search space. You can run 100 of these for the same cost and time of doing one api call to open ai.

joha42701mo ago

The guts of a LLM isn't something I'm well versed in, but

> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

connorbrinton1mo ago

More info:

* https://research.google/blog/looking-back-at-speculative-dec...

* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...

sails1mo ago

See also speculative cascades which is a nice read and furthered my understanding of how it all works

https://research.google/blog/speculative-cascades-a-hybrid-a...

speedping1mo ago

Verification is faster than generation, one forward pass for verification of multiple tokens vs a pass for every new token in generation

vanviegen1mo ago

I don't understand how it would work either, but it may be something similar to this: https://developers.openai.com/api/docs/guides/predicted-outp...

cma1mo ago

ml_basics1mo ago

They are referring to a thing called "speculative decoding" I think.

j / k navigate · click thread line to collapse