Running DeepSeek R1 Models Locally on NPU (opens in new tab)

(blogs.windows.com)

37 pointsdoomroot131y ago15 comments

15 comments

How much are NPUs more efficient than GPUs ? What are the limitations , it seems it will have support for deepseek R1 soon

tamlin1y ago

A decent chunk of AI computation is the ability to do matrix multiplication fast. Part of that is reducing the amount of data transferred to and from the matrix multiplication hardware on the NPU and GPU; memory bandwidth is a significant bottleneck. The article is highlighting 4-bit format use.

GPUs are an evolving target. New GPUs have tensor cores and support all kinds of interesting numeric formats, older GPUs don't support any of the formats that AI workloads are using today (e.g. BF16, int4, all the various smaller FP types).

NPU will be more efficient because it is much less general an GPU and doesn't have any gates for graphics. However, it is also fairly restricted. Cloud hardware is orders of magnitude faster (due to much higher compute resources I/O bandwidth), e.g. https://cloud.google.com/tpu/docs/v6e.

justincormack1y ago

NPU also has no more memory bandwidth than CPU, but then the GPU on these machines doesnt either.

tamlin1y ago

Agree on NPU vs CPU memory bandwidth, but not sure about characterizing the GPU that way. GDDR is usually faster than DDR of the same generation, and on higher end graphics cards has a width bus width. A few GPUs have HBM and pretty much all datacenter ML accelerators (NVidia B200 / H100 / A100, Google TPU, etc). The PCIe bus between the host memory and GPU memory is a bottleneck for intensive workloads.

To perform a multiplication on CPU, even SIMD, that values have to fetched and converted to a form the CPU has multipliers for. This means smaller numeric types penalised. For a 128-bit memory bus, an NPU can fetch 32 4-bit values per transfer; the best case for a CPU is 16 8-bit values.

Details are scant on Microsoft's NPU, but it probably has many parallel multipliers; either in the form of tensor cores or a systolic array. The effective number of matmul's per second (or per memory operation) is higher.

1 more reply

RandomBK1y ago

Reminder: DeepSeek distilled models are better thought of as fine-tunes of Qwen/Llama using DeepSeek output, and are not the same as actual DeepSeek v3 or R1.

This unfortunate naming has sown plenty of confusion around DeepSeek's quality and resource requirements. Actual DeepSeek v3/R1 continues to require at least ~100GB of VRAM/Mem/SSD, and this does not change that.

bestouff1y ago

Out of curiosity, would an A100 80GB work for this ?

bestouff1y ago

Replying to myself: apparently it's not 100GB VRAM but more around 700GB VRAM that's needed to run DeepSeek R1. The gear needed to run that would cost something in the vincinity of 100K€ !

RandomBK1y ago

Yup. I was referring to the 1.58B quant which seemed to be performing alright and would be the smallest real-DeepSeek model. That requires ~140GB, which is just barely doable on a 128GB RAM + 24GB VRAM setup + a lot of patience. Others have made it work at 64GB RAM + a fast SSD.

The true minimally-quantized DeepSeek experience will need one or possibly two 8xH100 nodes, so well upwards of $100K in CapEx.

darthrupert1y ago

Wait, what am I running on my 32GB Macbook then? I thought it was the 32b version of deepseek-r1.

RandomBK1y ago

The only 32B distill I'm aware of is `DeepSeek-R1-Distill-Qwen-32B`, which would be a base model of `Qwen-32B` distilled (further trained) on outputs from the full R1 model.

rahimnathwani1y ago

That model's weights are around 64GB: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-...

GP is likely running the 4-bit quantized version of the finetuned Qwen model.

rahimnathwani1y ago

Deepseek R1 has 671 billion parameters. Even if you could quantize each parameter to just 1 bit (from 8 bits), you'd still need 84GB of RAM just for the weights. There is no 32B parameter version of the V3/R1 model architecture.

Plankaluel1y ago

You are running Qwen2.5 32b that has been fine tuned on data that was generated by R1

j / k navigate · click thread line to collapse

15 comments

jokowueu1y ago

How much are NPUs more efficient than GPUs ? What are the limitations , it seems it will have support for deepseek R1 soon

tamlin1y ago

justincormack1y ago

NPU also has no more memory bandwidth than CPU, but then the GPU on these machines doesnt either.

tamlin1y ago

1 more reply

RandomBK1y ago

Reminder: DeepSeek distilled models are better thought of as fine-tunes of Qwen/Llama using DeepSeek output, and are not the same as actual DeepSeek v3 or R1.

bestouff1y ago

Out of curiosity, would an A100 80GB work for this ?

bestouff1y ago

Replying to myself: apparently it's not 100GB VRAM but more around 700GB VRAM that's needed to run DeepSeek R1. The gear needed to run that would cost something in the vincinity of 100K€ !

RandomBK1y ago

The true minimally-quantized DeepSeek experience will need one or possibly two 8xH100 nodes, so well upwards of $100K in CapEx.

darthrupert1y ago

Wait, what am I running on my 32GB Macbook then? I thought it was the 32b version of deepseek-r1.

RandomBK1y ago

The only 32B distill I'm aware of is `DeepSeek-R1-Distill-Qwen-32B`, which would be a base model of `Qwen-32B` distilled (further trained) on outputs from the full R1 model.

rahimnathwani1y ago

That model's weights are around 64GB: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-...

GP is likely running the 4-bit quantized version of the finetuned Qwen model.

rahimnathwani1y ago

Plankaluel1y ago

You are running Qwen2.5 32b that has been fine tuned on data that was generated by R1

j / k navigate · click thread line to collapse