undefined | Better HN

0 pointszozbot2347d ago0 comments

The basic bottleneck with 32GB RAM would be your storage, so for a baseline estimate you'd be looking at anything from ~2 secs per token (if you had really high performance PCIe 5.0 SSD at ~14 GB/s max) to ~5 secs per token (for an average PCIe 4.0 SSD, ~7 GB/s max). This would then be boosted by being able to keep the shared model layers in RAM, since these are part of the 25GB active parameters. I'm not sure what fraction of the active params that makes up for DeepSeek V4 Pro, but in a typical MoE it's about half, so you could approximately halve those secs-per-token figures. That's acceptable if you care about unattended inference for testing purposes or simple Q&A (leveraging the model's vast world knowledge); it doesn't look very good for interactive use. But the flip side is that you can batch a large amount of model queries together, since the KV cache for very short prompts is quite negligible. AIUI, that's basically unique to this series of models and a huge selling point.

0 comments

sylware7d ago

Alright, I don't understand anything, but you said ~5secs per token, then for prompts with hundreds to a thousand tokens, we are in the orders of tens of minutes to hours. I would be targetting coding prompts.

Well, it means one day I would have to get into the real thing: the real inference code, and actually run the inference of a small model.

j / k navigate · click thread line to collapse

0 comments

sylware7d ago

Well, it means one day I would have to get into the real thing: the real inference code, and actually run the inference of a small model.

j / k navigate · click thread line to collapse