DeepSeek R1 671B running on 2 M2 Ultras faster than reading speed (opens in new tab)

(twitter.com)

96 pointsthyrox1y ago29 comments

29 comments

Someone also got the full Q8 R1 running on a $6K PC without a GPU on 2x EPYC with 768GB DDR5 RAM running at 6-8 tok/s [1].

Will be interesting to see the value/performance compared to next gen M4 Ultra's (or Extreme?) vs NVIDIA's new DIGITS [2] when they're released.

[1] https://x.com/carrigmat/status/1884244369907278106

[2] https://www.nvidia.com/en-us/project-digits/

CamperBob21y ago

Alternative link: https://xcancel.com/carrigmat/status/1884244369907278106

mrcwinn1y ago

Digits will be $3k and have 128GB of unified memory, so don't we already know that it wouldn't compare well this this rig? 128 won't be enough to fit the model in memory.

As for Apple, we'll see.

mythz1y ago

They can be linked, e.g. 2x DIGITS can run 405B models [1]. Won't know what value/performance we can get until they start shipping them in May.

https://nvidianews.nvidia.com/news/nvidia-puts-grace-blackwe...

fulafel1y ago

What's the memory bandwidth like compared to the above EPYC setup that the tweeter claims has "24 channels of DDR5 RAM" ?

rahimnathwani1y ago

Wow!

6 to 8 tokens per second.

And less than a tenth of the cost of a GPU setup.

phonon1y ago

Nice! Xeon 6 using AMX-BF16/INT8 Instructions should be something like 5 times faster than that....

danans1y ago

Check out the power draw metrics. Following the CPU+GPU power consumption, it seems like it averaged 22W for about a minute. Unless I'm missing something, the inference for this example consumed at most .0004 kWh.

That's almost nothing. If these models are capable/functional enough for most day-to-day uses, then useful LLM-based GenAI is already at the "too cheap to meter" stage.

danans1y ago

So it seems like this was actually 7 M2 Ultras, not 2, so .0028 kWh?

teruakohatu1y ago

I am amazed mlx-lm/mlx.distributed works that well on prosumer hardware.

I don't think they specified what they were using for networking, but it was probably Thunderbolt/USB4 networking which can reach 40Gbps.

shihab1y ago

Please note that it’s using pretty aggressive quantization (around 4 bits per weight)

doctoboggan1y ago

Its not that aggressive of a quantization considering that the full model was trained at only 8 bits.

shihab1y ago

That doesn't necessarily mean final weights are 8-bit though. Tensor core ops are usually mixed precision- matmul happens in low precision but accumulation (i.e. final result) is done in much higher precision to reduce error.

from deepseek v3:

"For this reason, after careful investigations, we maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators...To further guarantee numerical stability, we store the master weights, weight gradients, and optimizer states in higher precision. "

OisinMoran1y ago

That's 16x fewer possible values though (and also just 16 possible values full stop). It would be like giving every person on Earth the same shoe size.

tetrisgm1y ago

The Henry Ford school of product

rashidae1y ago

This is amazing!! What kind of applications are you considering for this? A part from saving variable costs, fine tuning extensively and security… I’m curious to evaluate this in a financial perspective, as variable costs can be daunting, but not too much “yet”.

I’m hoping NVIDIA comes up with their new consumer computer soon!

iFred1y ago

Complete aside, but I think this is the first time I’ve seen Apple’s internal DNS outside of Apple.

CharlesW1y ago

scv = Santa Clara Valley

creativenolo1y ago

How is this split between two computers?

DrNosferatu1y ago

Heavily quantized…

Still interesting though.

mrcwinn1y ago

Fascinating to read the thinking process of a flush vs a straight in poker. It's circular nonsense that is not at all grounded in reason — it's grounded in the factual memory of the rules of Poker, repeated over and over as it continues to doubt itself and double-check. What nonsense!

How many additional nuclear power plants will need to be built because even these incredibly technical achievements are, under the hood, morons? XD

j / k navigate · click thread line to collapse

29 comments

mythz1y ago

Someone also got the full Q8 R1 running on a $6K PC without a GPU on 2x EPYC with 768GB DDR5 RAM running at 6-8 tok/s [1].

Will be interesting to see the value/performance compared to next gen M4 Ultra's (or Extreme?) vs NVIDIA's new DIGITS [2] when they're released.

[1] https://x.com/carrigmat/status/1884244369907278106

[2] https://www.nvidia.com/en-us/project-digits/

CamperBob21y ago

Alternative link: https://xcancel.com/carrigmat/status/1884244369907278106

mrcwinn1y ago

Digits will be $3k and have 128GB of unified memory, so don't we already know that it wouldn't compare well this this rig? 128 won't be enough to fit the model in memory.

As for Apple, we'll see.

mythz1y ago

They can be linked, e.g. 2x DIGITS can run 405B models [1]. Won't know what value/performance we can get until they start shipping them in May.

https://nvidianews.nvidia.com/news/nvidia-puts-grace-blackwe...

fulafel1y ago

What's the memory bandwidth like compared to the above EPYC setup that the tweeter claims has "24 channels of DDR5 RAM" ?

rahimnathwani1y ago

Wow!

6 to 8 tokens per second.

And less than a tenth of the cost of a GPU setup.

phonon1y ago

Nice! Xeon 6 using AMX-BF16/INT8 Instructions should be something like 5 times faster than that....

danans1y ago

That's almost nothing. If these models are capable/functional enough for most day-to-day uses, then useful LLM-based GenAI is already at the "too cheap to meter" stage.

danans1y ago

So it seems like this was actually 7 M2 Ultras, not 2, so .0028 kWh?

teruakohatu1y ago

I am amazed mlx-lm/mlx.distributed works that well on prosumer hardware.

I don't think they specified what they were using for networking, but it was probably Thunderbolt/USB4 networking which can reach 40Gbps.

shihab1y ago

Please note that it’s using pretty aggressive quantization (around 4 bits per weight)

doctoboggan1y ago

Its not that aggressive of a quantization considering that the full model was trained at only 8 bits.

shihab1y ago

from deepseek v3:

OisinMoran1y ago

That's 16x fewer possible values though (and also just 16 possible values full stop). It would be like giving every person on Earth the same shoe size.

tetrisgm1y ago

The Henry Ford school of product

rashidae1y ago

I’m hoping NVIDIA comes up with their new consumer computer soon!

iFred1y ago

Complete aside, but I think this is the first time I’ve seen Apple’s internal DNS outside of Apple.

CharlesW1y ago

scv = Santa Clara Valley

creativenolo1y ago

How is this split between two computers?

DrNosferatu1y ago

Heavily quantized…

Still interesting though.

mrcwinn1y ago

How many additional nuclear power plants will need to be built because even these incredibly technical achievements are, under the hood, morons? XD

j / k navigate · click thread line to collapse