undefined | Better HN

0 pointspama1y ago0 comments

A big part of why R1 is much slowerr than o3-mini is that inference optimization is not yet performed on most solutions for serving R1 models (so R1 is rather comparable to o1 or o1 pro in terms of latency rather than o1-mini or o3-mini). The MoE is already relatively efficient if perfectly load balanced in an inference setting and should have latencies and throughputs that are equal to or faster than equivalent dense models with 37B parameters. In practice due to MLA inference should be much faster yet for long contexts compared to typical dense models. If DeepSeek or someone else tried to distill the model onto another MoE architecture with even less active parameters and properly implement speculative decoding on top, one could gain additional speedups in inference. I imagine we will see these things but it takes a bit of time till they are all public.

0 comments

rfoo1y ago

I know that, I'm in this game. I was comparing API throughput/ttft/ttbt of DeekSeek's own R1 API before it went viral in the West, and o3-mini.

I remain unconvinced that DeepSeek themselves didn't optimize their own V3 inference good enough and left another 2x~3x improvement on the table.

pamaOP1y ago

I am sure DeepSeek did optimize the inference cost of R1. They did not yet release an efficient MoE downscaling of it, ie an R1-mini.

j / k navigate · click thread line to collapse

0 comments

rfoo1y ago

I know that, I'm in this game. I was comparing API throughput/ttft/ttbt of DeekSeek's own R1 API before it went viral in the West, and o3-mini.

I remain unconvinced that DeepSeek themselves didn't optimize their own V3 inference good enough and left another 2x~3x improvement on the table.

pamaOP1y ago

I am sure DeepSeek did optimize the inference cost of R1. They did not yet release an efficient MoE downscaling of it, ie an R1-mini.

j / k navigate · click thread line to collapse