DSpark: Speculative decoding accelerates LLM inference [pdf] (opens in new tab)

(github.com)

560 pointsaurenvale6h ago218 comments

218 comments

61 comments · 18 top-level

kamranjon4h ago· 25 in thread

DeepSeek continues to not only push the boundaries but also publish these incredible papers explaining how they achieved their gains - something the American labs no longer do unfortunately. Chinese labs are doing the most interesting work in AI right now.

tomalaci4h ago

Probably because American AI companies are on the hook for quite a lot of investment money. I think they are trying to find the magical moat to justify their valuation.

Revealing optimizations similar to these would pretty much reduce their competitive position.

lwansbrough4h ago

Chinese labs are also still behind, so they’re incentivized to collaborate and have no reason to do it in private.

I suspect their tune will change if they ever take the lead..

oefrha4h ago

Which is a good thing. Self-serving motives are more reliable than altruistic ones.

3 more replies

tw19844h ago

> Chinese labs are also still behind, so they’re incentivized to collaborate and have no reason to do it in private.

US labs in Google, Meta and SpaceX are not leading, none of them managed to build something on par with GLM 5.2.

Care to explain to me why they still don't collaborate and still choose to do it in private?

4 more replies

jmyeet4h ago

Projection is a funny thing. It causes people to misread situations all the time. Southern slaveowners feared violent retribution from freed slaves, for example [1]. It was pure projection and said more about the South than it did the slaves. The reality was there was no violent retribution. It was the opposite where the former slaveowners continued to inflict violence on the formerly enslaved.

I say this because we see the same thing used as an argument against China. "If they overtake us, they'll do imperialism (like us)." Again, it says more about us than them.

A better reading (IMHO) Of the situation is that China believes that AI shouldn't be used simply to mint a few more trillionaires but the benefits should be shared with society. Why do I say this? Because we now have 70+ years of China doing exactly that. The transformation in China all the way from rural villages to Tier 1 cities has been utterly astounding. China has lifted ~800M people out of extreme poverty.

In some ways we're at a similar point to the late 1990s and 2000s when Microsoft execs complained that Linux, being free, destroyed intellectual property value. Linux should be a perfect example of how people can and do act altruistically, or at least not in a way to bait-and-switch to enrich themselves.

[1]: https://www.reddit.com/r/AskHistory/comments/1d26grm/in_the_...

colordrops4h ago

So the marketplace is working.

1 more reply

cromka4h ago

I seriously am far from fear mongering and doomsday mentality, but I just can't see how OpenAI and Anthropic can have a successful IPO if the quality gap between the free and paid continues to narrow like that...

cyanydeez4h ago

fascism. it works be corporate fascism.

2 more replies

budsniffer9524h ago

Do you think that DeepSeek are building their models for free, or something? They aren't "on the hook" for anything?

What's with all the China glazing about this stuff? They release some open-source work and people act like they are suddenly the beacon of freedom and transparency.

abc123abc1234h ago

This is incorrect binary thinking. Them releasing open source can be good, but that does not commit you to think that china or chinese companies are saints. There are many shades of grey here and one does not exclude the other (nor include it).

1 more reply

7speter4h ago

I’m think its in our best interests to lever these american ai companies to exhibit at least some degree of freedom and transparency anyway we can…

herodoturtle4h ago

Publishing by necessity I wonder? American labs on the cutting edge pioneering the way forward, so Deepseek open sourcing what they’ve got is to help even the playing field.

Hopefully the experts here can offer insight. The above is just my hunch and I’m not a specialist in this field.

jonplackett4h ago

Wouldn’t that just help the American labs anyway though? Or do they assume they’ve actually already figured this stuff out and kept it secret?

7speter3h ago

From what I gather, the Chinese are behind, but a lot of their research amounts to scrappy, clever discoveries in how to use more novel technologies (for Qwen and Deepseek, its mixture of expert models, that can do inference using a portion of the model at a time). The chinese also distill information from American models, so there’s that.

The American companies, from my impression don’t involve themselves with such lowly “hacks” because they have so much money to just push forward with doing everything on big heavy models that run on the most cutting edge nvidia chips that they can, the moment, kinda sorta get on demand (I say that in some degree of jest).

_0ffh4h ago

I'm afraid I'm even balking at the word "pioneering" in context with US frontier labs. They are probably doing a few new things, right, but they are not blazing any trails for others to follow along, the Chinese are.

epolanski4h ago

Chinese papers and techniques have been very influential and copied by US labs.

Multi-head Latent Attention (MLA), Multi-Token prediction, MoE architecture are some of the most famous examples.

rvz4h ago

Exactly. They did not have to open up their research up and this is what happens when smart researchers are forced to squeeze performance gains out of existing hardware.

They don't have TPUs or access to the latest Vera Rubin GPUs either to get performance gains for free. All of the optimizations Deepseek have done are in software and it goes down to the PTX assembly level.

Compared to Anthropic who are celebrating in fixing a flickering issue in a terminal app which took months to fix.

vidarh4h ago

> Compared to Anthropic who are celebrating in fixing a flickering issue in a terminal app which took months to fix.

It's funny, because if you ran Claude Code on a slow terminal, the cause of the flicker was obvious: They kept dumping the entire history of the chat back into the terminal in a number of situations, and relied on the terminal to them end up in the correct state.

yorwba4h ago

Anthropic almost certainly also has optimized software down to the assembly level, considering this take-home interview challenge they published: https://github.com/anthropics/original_performance_takehome/... which is all about instruction-level performance optimizations. That they don't prioritize UI fixes just means they consider other things more important.

lelanthran3h ago

Unlikely: that product is written completely by AI, of which they are not lacking.

More likely is that an AI generated codename is impossible to fix by humans, and SOTA was not able to figure it out until now.

lionkor3h ago

that's pretty silly to use as a measure of what they do internally

epolanski4h ago

R1 was very influential on US models development.

jmyeet4h ago

Chinese companies (and labs) operate in conjunction with the CCP so whatever they're doing, it's because it's Chinese state policy.

What became clear when DeepSeek came onto the scene was that China was seeking to commoditize LLMs. They consider it an issue of national security not to be beholden to US tech companies when it comes to AI. And I, for one, fully endorse this policy.

Another data point on this is the black market for Claude tokens in China [1]. The chat logs themselves are a commodity to train models.

I believe that OpenAI in particular is a bet on a trillion dollar pot of gold that doesn't exist. Google, Microsoft, Amazon and Meta will all be fine. Anthropic is in a far better position than OpenAI (IMHO) but if DeepSeek or some other Chinese open weight model gets as good at coding, they're in real trouble too.

[1]: https://news.ycombinator.com/item?id=48667495

tw19843h ago

> Another data point on this is the black market for Claude tokens in China [1]. The chat logs themselves are a commodity to train models.

anyone with IQ higher than 130 (thus qualified for actual AI R&D) would be questioning something obvious here -

if they are already doing such dodgy stuff with the aim to maximize profits, why would those resellers have large amount of logs with actual American model responses to sell to those AI labs in the first place. shouldn't they just post train & customize some leading Chinese open source models to pretend to be Opus or GPT for the vast majority of their users (as classified by some models) who don't know much about expected Opus behaviours & not skilled enough to tell the differences?

that is actually the interesting bit not covered in your censored version of the story line, it is also what happens on the ground. your censored version of the story implies that those dodgy resellers using stolen credit cards, pooling accounts with stolen IDs and illegally selling very personal logs would somehow be honest enough to spend extra $ to ensure their victims (aka paying users) can actually use real Opus and GPT. LOL

dude, you failed this IQ test miserably.

jampekka3h ago

The galaxy brains in the labs putatively buying the logs wouldn't notice this? Or figure out a structure to prevent this?

pokot04h ago· 6 in thread

I am wondering if this is why they can offer their pro model at ~1/4th of the price compared to the other providers offering the same model, and if other providers will be able to do the same in a short timeframe.

sschueller4h ago

I have been heavily using DeepSeek V4 Pro at Max for a month now and I would say it is 100x cheaper. If I pay for Claude I will hit that limit so fast I am always waiting 5 hours. Using the frontier models at Kilo I go through dollars while doing the same thing via DeepSeek it is pennies.

ddxv3h ago

I believe the comment you replied to was talking about the cost on providers like OpenCode vs Deepseek API. Deepseek API is even cheaper than the other providers for the same deepseek models.

vidarh4h ago

It'd presumably help a lot, but also when you use their endpoint they get more training data.

nicce4h ago

This applies to every provider. OpenAI seems to be the worst hoarder.

pokot04h ago

actually you can buy inference on third party providers that serve deepseek v4 pro with zero data retention (ZDR).

1 more reply

epolanski4h ago

US labs do it too.

piterrro5h ago· 5 in thread

I’ve been using DeepSeek v4 pro for a month now in Kilo Code and its great. Fast, reliable, large context window and cheap as… Did 1,5B tokens this month and cost me 40usd (majority cached, but still).

richardlblair2h ago

I've been using omp with deepseek as my task and quicktask agents, and sonnet as everything else.

It's drastically reduced my AI spend. I went from spending $40/day to $10/day.

spiderfarmer5h ago

Is there a way to see how many tokes one does with claude code (pro)?

bpavuk4h ago

the casino has no clocks, as one HN user put it some time ago.

I second ccusage, it's nice

cptchaos4h ago

https://ccusage.com/

edg50004h ago

It's in the JSONs in ~/.claude, but last 30 days only I think. You can have the model analyze history. So for correct history you'd need to run history analysis on a cron job or something. Kinda hacky.

28383838385h ago· 3 in thread

Must be wonderful to be on the board of OpenAi et al & their PE investors whilst China keeps blowing up these mines under their feet lmao. Luckily Korean pension funds will buy all the trash as usual but goddamn you gotta start moving quick or you are gonna need some serious AGI to show you how to offload those bonds

ForHackernews4h ago

"We will build the machine-god and pray for it to pay for itself."

FridgeSeal4h ago

Every day, the rate of “could post a picture of 40k tech priests and have it taken unironically” goes up, and it’s starting to get concerning.

ozgrakkurt4h ago

Don’t worry they will sell all the hardware and data they acquired with their grift

ricardobeat5h ago· 2 in thread

Presumably this has been in production for a while, and is one of the reasons they were able to dramatically lower prices a month ago?

chronogram3h ago

Yes. Section 5 talks about real-world deployment: 5.1: "The DSpark draft models are co-deployed with the preview versions of DeepSeek-V4-Flash and DeepSeek-V4-Pro"; 5.4: "MTP-1 represents the former production setup, having been superseded by DSpark two weeks following the DeepSeek-V4-preview release."

_0ffh4h ago

Lookahead Sparse Attention should be playing a big role as well, as it dramatically slashes memory consumption.

Jackobrien5h ago· 2 in thread

I see a world soon where there’s an extremely wide variety of small models for speculative decoding, unique to use cases, companies, and even individuals.

nicce5h ago

Hopefully that is the case and hardware does not get impossible to get.

pydry4h ago

yes, heavily constrained by sophisticated guardrails.

this is definitely where things are going. the enormous "eat the world" models have extreme diminishing returns by comparison.

StizzurpXDD2h ago

DeepSeek is, as I feel currently, the sole AI company which is actually trying to innovate rather than top mere benchmarks. Others like OpenAI, Anthropic and Google are mostly just competeing with each rather than keep innovating around the clock.

5 more replies

kamranjon4h ago

The hugging face models are already up and seem to be the original models with the speculative decoding module built in which is very cool:

Flash: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark

Pro: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark

Excited to see if this makes it into DwarfStar for local inference, have been using the flash model extensively since the 2-bit quants were made available by antirez.

1 more reply

Havoc5h ago

Nice.

Guessing the timing isn't accidental. Demonstrated openness vs harsh regulation

2 more replies

xnx46m ago

Is this newer/better than the speculative decoding from 2022? https://arxiv.org/abs/2211.17192

articlepan1h ago

Title is bad, it's the first line of the abstract instead of the paper title. Speculative decoding for LLM inference was published in 2022: https://arxiv.org/abs/2211.17192

This paper seems to be an improvement to speculative decoding but I haven't read it yet.

wg042m ago

That's why I pay them. Regularly. Without fail. Despite my token usage isn't that much.

But I vote for these heroes with my wallet. Just yesterday did again.

danielabinav1604h ago

Would love to see these numbers reproduced on consumer GPUs, not just A100s.

2 more replies

lelanthran3h ago

These companies providing tokens, whether SOTA or not, that want to IPO are so fucked as time goes on.

Can't sell their SOTA models, only slightly better than the open source models for the models they can sell, cost 20x to 50x for good models, a TAM that consists almost solely of developers, with no customer of theirs actually boasting increased profits as a result of AI...

I fear their time to IPO may have passed.

2 more replies

rvz5h ago

This is just one of many papers DeepSeek have released to be able to serve models at extremely cheap prices, unlike the others taking on >$100B+ of debt in building data centers for the same thing.

> As with V4-Flash, we treat this point as an indication that DSpark sustains useful throughput under an interactivity target that the baseline cannot efficiently support. At matched system capacities, DSpark delivers 57% to 78% faster per-user generation.

Reminds me of the flawed solution in scaling servers in 2017 that use memory-intensive technologies by adding even more servers to solve the problem. (It just increases costs.)

Rather than doing that, think about which critical parts of your app can be written in a more performant technology.

Fast forward to 2026, now you can see who is just throwing more money at the problem to create even more problems where as DeepSeek is giving us optimized solutions.

I know exactly who I would pay attention to, and it is absolutely not Anthropic.

2 more replies

bflesch3h ago

At this point why can't someone produce a fridge or container-sized AI appliance based on legacy chips (12nm)? I imagine this would cover 80% of corporate use cases where you need to "google-in-a-box" functionality.

The state-of-the-art nanometer are impossible to achieve but if you have infinite solar energy during business hours does it really matter? Every company has a parking spot so this ASIC-like appliance could be as big as a shipping container.

If it could just run recent open models for a handful of users it would be such a nobrainer to buy.

3 more replies

preetham_rangu5h ago

do they use their OCR, or someone else?

playorizaya1h ago

Meanwhile OpenAI is drafting an “open letter” to Congress /s

OpenAI and Anthropic are doing nothing interesting.

Basically forgot about them 2 years ago.

I don’t use DeepSeek either but at least they do interesting stuff - they were the first to do “thinking” iirc

j / k navigate · click thread line to collapse

218 comments

61 comments · 18 top-level

kamranjon4h ago· 25 in thread

tomalaci4h ago

Probably because American AI companies are on the hook for quite a lot of investment money. I think they are trying to find the magical moat to justify their valuation.

Revealing optimizations similar to these would pretty much reduce their competitive position.

lwansbrough4h ago

Chinese labs are also still behind, so they’re incentivized to collaborate and have no reason to do it in private.

I suspect their tune will change if they ever take the lead..

oefrha4h ago

Which is a good thing. Self-serving motives are more reliable than altruistic ones.

3 more replies

tw19844h ago

> Chinese labs are also still behind, so they’re incentivized to collaborate and have no reason to do it in private.

US labs in Google, Meta and SpaceX are not leading, none of them managed to build something on par with GLM 5.2.

Care to explain to me why they still don't collaborate and still choose to do it in private?

4 more replies

jmyeet4h ago

I say this because we see the same thing used as an argument against China. "If they overtake us, they'll do imperialism (like us)." Again, it says more about us than them.

[1]: https://www.reddit.com/r/AskHistory/comments/1d26grm/in_the_...

colordrops4h ago

So the marketplace is working.

1 more reply

cromka4h ago

cyanydeez4h ago

fascism. it works be corporate fascism.

2 more replies

budsniffer9524h ago

Do you think that DeepSeek are building their models for free, or something? They aren't "on the hook" for anything?

What's with all the China glazing about this stuff? They release some open-source work and people act like they are suddenly the beacon of freedom and transparency.

abc123abc1234h ago

1 more reply

7speter4h ago

I’m think its in our best interests to lever these american ai companies to exhibit at least some degree of freedom and transparency anyway we can…

herodoturtle4h ago

Publishing by necessity I wonder? American labs on the cutting edge pioneering the way forward, so Deepseek open sourcing what they’ve got is to help even the playing field.

Hopefully the experts here can offer insight. The above is just my hunch and I’m not a specialist in this field.

jonplackett4h ago

Wouldn’t that just help the American labs anyway though? Or do they assume they’ve actually already figured this stuff out and kept it secret?

7speter3h ago

_0ffh4h ago

epolanski4h ago

Chinese papers and techniques have been very influential and copied by US labs.

Multi-head Latent Attention (MLA), Multi-Token prediction, MoE architecture are some of the most famous examples.

rvz4h ago

Exactly. They did not have to open up their research up and this is what happens when smart researchers are forced to squeeze performance gains out of existing hardware.

Compared to Anthropic who are celebrating in fixing a flickering issue in a terminal app which took months to fix.

vidarh4h ago

> Compared to Anthropic who are celebrating in fixing a flickering issue in a terminal app which took months to fix.

yorwba4h ago

lelanthran3h ago

Unlikely: that product is written completely by AI, of which they are not lacking.

More likely is that an AI generated codename is impossible to fix by humans, and SOTA was not able to figure it out until now.

lionkor3h ago

that's pretty silly to use as a measure of what they do internally

epolanski4h ago

R1 was very influential on US models development.

jmyeet4h ago

Chinese companies (and labs) operate in conjunction with the CCP so whatever they're doing, it's because it's Chinese state policy.

Another data point on this is the black market for Claude tokens in China [1]. The chat logs themselves are a commodity to train models.

[1]: https://news.ycombinator.com/item?id=48667495

tw19843h ago

> Another data point on this is the black market for Claude tokens in China [1]. The chat logs themselves are a commodity to train models.

anyone with IQ higher than 130 (thus qualified for actual AI R&D) would be questioning something obvious here -

dude, you failed this IQ test miserably.

jampekka3h ago

The galaxy brains in the labs putatively buying the logs wouldn't notice this? Or figure out a structure to prevent this?

pokot04h ago· 6 in thread

sschueller4h ago

ddxv3h ago

I believe the comment you replied to was talking about the cost on providers like OpenCode vs Deepseek API. Deepseek API is even cheaper than the other providers for the same deepseek models.

vidarh4h ago

It'd presumably help a lot, but also when you use their endpoint they get more training data.

nicce4h ago

This applies to every provider. OpenAI seems to be the worst hoarder.

pokot04h ago

actually you can buy inference on third party providers that serve deepseek v4 pro with zero data retention (ZDR).

1 more reply

epolanski4h ago

US labs do it too.

piterrro5h ago· 5 in thread

richardlblair2h ago

I've been using omp with deepseek as my task and quicktask agents, and sonnet as everything else.

It's drastically reduced my AI spend. I went from spending $40/day to $10/day.

spiderfarmer5h ago

Is there a way to see how many tokes one does with claude code (pro)?

bpavuk4h ago

the casino has no clocks, as one HN user put it some time ago.

I second ccusage, it's nice

cptchaos4h ago

https://ccusage.com/

edg50004h ago

28383838385h ago· 3 in thread

ForHackernews4h ago

"We will build the machine-god and pray for it to pay for itself."

FridgeSeal4h ago

Every day, the rate of “could post a picture of 40k tech priests and have it taken unironically” goes up, and it’s starting to get concerning.

ozgrakkurt4h ago

Don’t worry they will sell all the hardware and data they acquired with their grift

ricardobeat5h ago· 2 in thread

Presumably this has been in production for a while, and is one of the reasons they were able to dramatically lower prices a month ago?

chronogram3h ago

_0ffh4h ago

Lookahead Sparse Attention should be playing a big role as well, as it dramatically slashes memory consumption.

Jackobrien5h ago· 2 in thread

I see a world soon where there’s an extremely wide variety of small models for speculative decoding, unique to use cases, companies, and even individuals.

nicce5h ago

Hopefully that is the case and hardware does not get impossible to get.

pydry4h ago

yes, heavily constrained by sophisticated guardrails.

this is definitely where things are going. the enormous "eat the world" models have extreme diminishing returns by comparison.

StizzurpXDD2h ago

5 more replies

kamranjon4h ago

The hugging face models are already up and seem to be the original models with the speculative decoding module built in which is very cool:

Flash: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark

Pro: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark

Excited to see if this makes it into DwarfStar for local inference, have been using the flash model extensively since the 2-bit quants were made available by antirez.

1 more reply

Havoc5h ago

Nice.

Guessing the timing isn't accidental. Demonstrated openness vs harsh regulation

2 more replies

xnx46m ago

Is this newer/better than the speculative decoding from 2022? https://arxiv.org/abs/2211.17192

articlepan1h ago

Title is bad, it's the first line of the abstract instead of the paper title. Speculative decoding for LLM inference was published in 2022: https://arxiv.org/abs/2211.17192

This paper seems to be an improvement to speculative decoding but I haven't read it yet.

wg042m ago

That's why I pay them. Regularly. Without fail. Despite my token usage isn't that much.

But I vote for these heroes with my wallet. Just yesterday did again.

danielabinav1604h ago

Would love to see these numbers reproduced on consumer GPUs, not just A100s.

2 more replies

lelanthran3h ago

These companies providing tokens, whether SOTA or not, that want to IPO are so fucked as time goes on.

I fear their time to IPO may have passed.

2 more replies

rvz5h ago

This is just one of many papers DeepSeek have released to be able to serve models at extremely cheap prices, unlike the others taking on >$100B+ of debt in building data centers for the same thing.

Reminds me of the flawed solution in scaling servers in 2017 that use memory-intensive technologies by adding even more servers to solve the problem. (It just increases costs.)

Rather than doing that, think about which critical parts of your app can be written in a more performant technology.

Fast forward to 2026, now you can see who is just throwing more money at the problem to create even more problems where as DeepSeek is giving us optimized solutions.

I know exactly who I would pay attention to, and it is absolutely not Anthropic.

2 more replies

bflesch3h ago

If it could just run recent open models for a handful of users it would be such a nobrainer to buy.

3 more replies

preetham_rangu5h ago

do they use their OCR, or someone else?

playorizaya1h ago

Meanwhile OpenAI is drafting an “open letter” to Congress /s

OpenAI and Anthropic are doing nothing interesting.

Basically forgot about them 2 years ago.

I don’t use DeepSeek either but at least they do interesting stuff - they were the first to do “thinking” iirc

j / k navigate · click thread line to collapse