Run DeepSeek R1 Dynamic 1.58-bit (opens in new tab)

(unsloth.ai)

767 pointsnoch1y ago332 comments

332 comments

154 comments · 33 top-level

apples_oranges1y ago· 31 in thread

Random observation 1: I was running DeepSeek yesterday on my Linux with a RTX 4090 and I noticed that the models should fit into VRAM, which is 24GB. Or they are simply slow. So the Apple shared memory architecture has an advantage here. A 192GB Mx Ultra can load and process large models efficiently.

Random observation 2: It's time to cancel the OpenAI subscription.

yobid201y ago

I canceled my OpenAI subscription last night, as did many many others. There were some threads in reddit with everyone chiming in they all just canceled too. imo OpenAI is done, and will go through massive cuts and probably acquired by the end of the year for a very tiny fraction of its current value.

sailingparrot1y ago

You want to bet? The panic around deepseek is getting completely disconnected from reality.

Don’t get me wrong what DS did is great, but anyone thinking this reshape the fundamental trend of scaling laws and make compute irrelevant is dead wrong. I’m sure OpenAI doesn’t really enjoy the PR right now, but guess what OpenAI/Google/Meta/Anthropic can do if you give them a recipe for 11x more efficient training ? They can scale it to their 100k GPUs clusters and still blow everything. This will be textbook Jevons paradox.

Compute is still king and OpenAI has worked on their training platform longer than anyone.

Of course as soon as the next best model is released, we can train on its output and catch up at a fraction of the cost, and thus the infinite bunny hopping will continue.

But OpenAI is very much alive.

7 more replies

generalizations1y ago

In my experience with deepseek and o1, openai's big talk about (and investment into) hallucination avoidance might save their hides here. Deepseek may be smarter, and understand complex problems better, but it also seems to make mistakes more often. (It's as if it's comprehension is better, but it's worse at memorization/recall.)

Need an LLM to one-shot some complex network scripting? as of last night, o1 is still where its at.

2 more replies

wqaatwt1y ago

IMHO o1 it’s still comparable to a lot better for accomplishing actual stuff than DeepSeek. At least for my use cases.

Of course cost is incomparably higher since plus has a very low limit. Which of course is a huge deal.

conradfr1y ago

Why every time there is a new model all the other competitors are declared immediately dead?

2 more replies

nicce1y ago

1. You can get all the models by buying Kagi subscription (excluding o1). Includes DeepSeek models. You can also feed the assistant with search data that you can filter.

2. If you have GitHub Copilot, you get o1 chat also there.

I haven't seen much value with OpenAI subscription for ages.

1 more reply

kebaman1y ago

Doesn't Microsoft own 49% of OpenAI? They'll end up with it all as a division of Microsoft.

1 more reply

yieldcrv1y ago

I disagree, I don't really need "conversational chat responses", I need multimodal

ChatGPT is the king of the multimodal experience still. Anthropic is a distant second, only because it lets you upload images from the clipboard and responds to them, but it can't do anything else like generate images - sometimes it will do a flowchat which is kind of cool, GPT won't do that - but will it speak to you, have tones, listen to you? no.

And in the open source side, this area has been stagnant for like 18 months. There is no cohesive multimodal experience yet. Just a couple vision models with chat capabilities and pretty pathetic GUIs to support them. You have to still do everything yourself there.

There is a huge utility for me, and many others that dont know it yet, if we could just load a couple models at once that work together seamlessly in a single seamless GUI like how ChatGPT works.

immibis1y ago

The real insult here is graphics card vendors refusing to make ones with more than 24GB for several years now. They do this so you'll have to buy several cards for your AI workstation. Hopefully Apple eating their lunch fixes this.

regularfry1y ago

The 5090 is 32GB out of the box. Not that that's anywhere near the top of what you can do on an Apple, but at least it's movement.

TeMPOraL1y ago

> They do this so you'll have to buy several cards for your AI workstation.

AFAIK you can't do that with newer consumer cards, which is why this became an annoyance. Even a RTX 4070 Ti with its 12 GB would be fine, if you could easily stack a bunch of them like you used to be able with older cards.

1 more reply

therealpygon1y ago

I honestly don’t know why people aren’t more upset by this and still get on their knees for Nvidia. They made the decision specifically to cripple consumer card memory because they didn’t like data centers were using them instead of buying their overpriced enterprise cards that were less performant. They removed NVLink because people were getting better performance out of their two $400 cards than the $1,500 cards Nvidia was trying to peddle. They willfully screw consumers and people love them for it.

1 more reply

sliken1y ago

Or buy 2 Nvidia digits for $6,000 to get 256GB vram.

anakaine1y ago

I disagree with cancelling the OpenAI subscription. I've been getting some help from o1 for both python and php recently, and o1 was doing massively better for the python stuff (it ran, deepseeks didn't and wont with prompt refinement).

neom1y ago

Also for some philosophical stuff DeepSeek just won't do it. I'm working on an essay about spirituality and sometimes it just responds that it doesn't know how to work on those types of problems and we should do something fun like math or games, claud tends to reply with something more like "I have to be honest with you, reincarnation is not real" and ChatGPT doesn't seem to care about that kinda thing at all.

1 more reply

mst1y ago

IIRC thezvi's summary post on R1 mentioned that R1 is amazing for general reasoning and is very clearly a successful proof of concept/capability but a lot of effort seems to have been put into making o1 Good At Code as a practical matter, whereas R1 seems to have been more a research project which proved out the approaches and then was released without sanding the rough edges off because that wasn't the point.

gradus_ad1y ago

Were you running a local model?

nodja1y ago

While 192GB of ram is appealing, it's also quite expensive at $6000. For that price I rather buy a system with 5 used 3090s, which while being "only" 120GB of VRAM, you benefit from much faster tokens/s and prompt processing speed (the macs are notoriously slow at consuming large contexts).

xbmcuser1y ago

I think just getting nvidia Project Digits might be the best option. A lot of people when it was announced were underwhelmed. But I think now it could be just the thing for people making their own ai home servers.

https://www.nvidia.com/en-us/project-digits/

1 more reply

orf1y ago

Can I use that on the train though? I can with a 128GB MacBook, without it sounding like a helicopter taking off as well.

5 more replies

phkahler1y ago

>> While 192GB of ram is appealing, it's also quite expensive at $6000.

That's because it's Apple. It time to start moving to AMD systems with shared memory. My Zen 3 APU system has 64GB these days and its a mini ITX board.

1 more reply

yobid201y ago

The power requirement for 5x5090s is 10x higher , so you'll spend far more than $6000 in electricity over time.

1 more reply

miohtama1y ago

5x 3090 is also much more power hungry?

1 more reply

danielhanchen1y ago

The good thing is since MoEs are mainly memory bound, we just need (VRAM + RAM) to be in the range of 80GB or so in my tests for at least 5 tokens or so /s.

It's better to get (VRAM + RAM) >= 140GB for at least 30 to 40 tokens/s, and if VRAM >= 140GB, then it can approach 140 tokens/s!

Another trick is to accept more than 8 experts per pass - it'll be slower, but might be more accurate. You could even try reducing the # of experts to say 6 or 7 for low FLOP machines!

danielhanchen1y ago

Oh yes 192GB machines should be able these quants (131GB for 1.58bit, 158GB for 1.73bit, 183GB for 2.22bit) well :)

bradfox21y ago

Great release Daniel. Applaud the consistency you have shown.

Can you release slightly bigger quant versions? Would enjoy something that runs well on 8x32 v100 and 8x80 A100.

1 more reply

moffkalast1y ago

Yes, shared memory is a pretty big leg up since it lets the GPU process the whole model even if the bandwidth is slower which still has some benefits.

Apple's M chips, AMD's Strix Point/Halo chips, Intel's Arc iGPUs, Nvidia's Jetsons. The main issue with all of these though is the lack of raw compute to complement the ability to load insanely large models.

ant6n1y ago

So I'm thinking, inference seems mostly memory bound. With a fast CPU (for example 7950x with 16 cores), and 256GB of RAM (seems to be the max), shouldn't that give you plenty of ability to run the largest models (albeit a bit slowly).

It seems that AMD Epyc CPUs support terabytes of ram, some are as cheap as 1000 EUR. why not just run the full R1 model on that - seems that it would be much cheaper than multiple of those insane NVidia-Karten.

throw-qqqqq1y ago

The bottleneck is mainly memory bandwidth. AMD EPYC hw is appealing for local inference because it has a higher memory bandwidth than desktop gear (because 8-12 memory channels vs 2 on almost everything else), but not as fast as the Apple architectures and nowhere near VRAM speeds. If you want to drastically exceed ~3-5 tokens/s on 70b-q4 models, you usually still need GPUs.

3 more replies

immibis1y ago

FWIW Threadrippers go up to 1TB and Threadripper Pro up to 2TB. That's even in the lowest model of each series. (I know this because it happens to be the chip I have. Not saying you shouldn't go for Epyc if it works out better.)

1 more reply

mory911y ago

idk, in my daily work i still see o1 being more useful, did you observe both having the same reasoning power?

Jasondells1y ago· 25 in thread

An 80% size reduction is no joke, and the fact that the 1.58-bit version runs on dual H100s at 140 tokens/s is kind of mind-blowing. That said, I’m still skeptical about how practical this really is for most people. Like, yeah, you can run it on 24GB VRAM or even with just 20GB RAM, but "slow" is an understatement—those speeds would make even the most patient person throw their hands up.

And then there’s the whole repetition issue. Infinite loops with "Pygame’s Pygame’s Pygame’s" kind of defeats the point of quantization if you ask me. Sure, the authors have fixes like adjusting the KV cache or using min_p, but doesn’t that just patch a symptom rather than solve the actual problem? A fried model is still fried, even if it stops repeating itself.

On the flip side, I love that they’re making this accessible on Hugging Face... and the dynamic quantization approach is pretty brilliant. Using 1.58-bit for MoEs and leaving sensitive layers like down_proj at higher precision—super clever. Feels like they’re squeezing every last drop of juice out of the architecture, which is awesome for smaller teams who can’t afford OpenAI-scale hardware.

"accessible" still comes with an asterisk. Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup? For that price, I’d rather build a rig with used 3090s and get way more bang for my buck (though, yeah, it’d be a power hog). Cool tech—no doubt—but the practicality is still up for debate. Guess we'll see if the next-gen models can address some of these trade-offs.

danielhanchen1y ago

Oh the repetition issue is only on the non dynamic quants :) If you do dynamic quantization and use the 1.58bit dynamic quantized model the repetition issue fully disappears!

Min_p = 0.05 was a way I found to counteract the 1.58bit model generating singular incorrect tokens which happen around 1 token per 8000!

smcleod1y ago

min_p is great, do you apply a small amount of temperate as well?

2 more replies

huijzer1y ago

> That said, I’m still skeptical about how practical this really is for most people.

I'm running Open WebUI for months now for me and some friends as a front-end to one of the API providers (deepinfra in my case, but there are many others, see https://artificialanalysis.ai/).

Having 1.58-bit is very practical for me. I'm looking much forward to the API provider adding this model to their system. They also added a Llama turbo (also quantized) a few months back so I have good hopes.

danielhanchen1y ago

Oh I love Open WebUI as well!! But glad to hear the 1.58bit version could be helpful to you!

rafaelmn1y ago

>Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup?

AMD strix halo APU will have quad channel memory and will launch soon so expect these kinds of setups available for much less. Apple is charging an arm and a leg for memory upgrades, hopefully we get competition soon. From what I saw at CES OEMs are paying attention to this use case as well - hopefully not following suite on RAM markups.

sliken1y ago

Keep in mind the strix halo APU has a 256 bit wide memory bus and the Mac Ultra has a 1024 bit wide memory bus.

Here's hoping the Nvidia Digit (GB10 chip) has a 512 bit or 1024 bit wide interface, otherwise the Strix Halo will be the best you can do if you don't get the Mac Ultra.

1 more reply

dagmx1y ago

Unfortunately, Apple’s RAM and Storage upgrade prices are very in line with other class comparable OEMs.

I’m sure there’ll be some amount of undercutting but I don’t think it’ll be a huge difference on the RAM side itself.

2 more replies

jairuhme1y ago

At my work, we self-host some models and have found that for anything remotely similar to RAG or use cases that are very specific, the quantized models have proven to be more than sufficient. This helps us keep them running on smaller infra and generally lower costs

michaelt1y ago

Personally I've noticed major changes in performance between different quantisations of the same model.

Mistral's large 123B model works well (but slowly) at 4-bit quantisation, but if I knock it down to 2.5-bit quantisation for speed, performance drops to the point where I'm better off with a 70B 4-bit model.

This makes me reluctant to evaluate new models in heavily quantised forms, as you're measuring the quantisation more than the actual model.

2 more replies

smcleod1y ago

I just ran it up on 48gb (2x 3090) + overflow into CPU RAM and it runs at around 4tk/s (only a little 8k context size though) which while absolutely not something I'd personally use daily - it is actually usable.

eurekin1y ago

I have similar set-up - can you help out with running it? Was it in ollama?

EDIT: It seems that original authors provided a nice write-up:

https://unsloth.ai/blog/deepseekr1-dynamic#:~:text=%F0%9F%96...

1 more reply

ryao1y ago

Which one did you run?

elorant1y ago

Not everyone needs the largest model. There are variations or R1 with fewer parameters that can easily run on consumer hardware. With 80% size reduction you could run 70B on 8-bit on an RTX 3090.

Other than that, if you really need the big one you can get six 3090s and you're good to go. It's not cheap, but you're running a ChatGPT equivalent model from your basement. A year ago this was a wetdream for most enthusiasts.

whimsicalism1y ago

There’s a huge difference both in capabilities and in meaning between “variations of r1” and “r1 distill”. ollama is intentionally misleading people on this but the distills are much much worse

1 more reply

Kye1y ago

I ran whatever version Ollama downloaded on a 3070ti (laptop version). It's reasonably fast. Generative stuff can get weird if you do prompts like "in the style of" or "a new episode of" because it doesn't seem to have much pop culture in its training data. It knows the Stargate movie, for example, and seems to have the IMDB info for the series, but goes absolutely ham trying to summarize the series.

This line in the stuff inside the <think> section suggests it's also been trained on YouTube clips:

>> "I'm not entirely sure if I got all the details right, but this is what I remember from watching clips and summaries online."

An excerpt from the generated summary:

>> "Set in the 23rd century during a Z-Corp invasion, the series features action sequences, strategic thinking, and humor. It explores themes of international espionage, space warfare, and humanity's role in the cosmos. The show incorporates musical numbers and catchy theme songs for an engaging viewing experience. The plot involves investigating alien warships and their secret base on Kessari planet while addressing personal conflicts and philosophical questions about space."

"It explores themes of international espionage, space warfare, and humanity's role in the cosmos" is the closest to correct line in the whole output.

2 more replies

brookst1y ago

Or if you want a large model but don’t need high performance, get a Mac with 128GB UMA.

1 more reply

F7F7F71y ago

People would only be 'throwing their hands up' because commercial LLMs have set unreasonable expectations for folks.

Anyone who has a/the need for or understands the value of a local LLM would be OK with this kind of output.

bnchrch1y ago

Everyone has the need for on device LLM, if the response rate was fast!

2 more replies

goosejuice1y ago

I use commercial LLMs every day. The best of them can still be infuriating at times to the point of being unproductive. So I'm not sure I agree here.

ricardobeat1y ago

The repetition issue happens on simple quantization, what they are releasing is an approach that fixes that.

danielhanchen1y ago

Yes exactly! I edited the blog post to make the wording a bit better!

JKCalhoun1y ago

Layman here — but I am hopeful for 1.58 bit plus custom silicon to be the Holy Grail. I suppose I am setting high expectations on Apple to integrate said in their next "A" chip.

Wishful thinking.

danielhanchen1y ago

Ye a custom chip would be insane! 1.5 bit with a scaling factor seems to be actually usable for MoEs with shared experts!

sliken1y ago

I do want a 192GB Mac Ultra, I'm hoping the Nvidia Digit achieves similar at $3,000. Sadly no specifications or benchmarks, so tokens/sec is just a guess at this point.

yodsanklai1y ago

> I’d rather build a rig with used 3090s and get way more bang for my buck

I'm curious, what would you use that rig for?

brap1y ago· 12 in thread

As someone who is out of the loop, what’s the verdict on R1? Was anyone able to reproduce the results yet? Is the claim that it only took $5M to train generally accepted?

It’s a very bold claim which is really shaking up the markets, so I can’t help but wonder if it was even verified at this point.

huijzer1y ago

> Is the claim that it only took $5M to train generally accepted?

Based on Nvidia being down 18% yesterday I would say the claim is generally accepted.

willsmith721y ago

Because the markets are rational, all-knowing, and have never been wrong?

3 more replies

tarruda1y ago

It is still unconfirmed since no one outside of deepseek reproduced it.

If confirmed, Nvidia could go down even more

2 more replies

infecto1y ago

While Deepseek was an instigator in the price movements I would not say its accepted.

afavour1y ago

I don’t see them as related. The market moves when there is money to be made. It’s only tangentially related to any kind of general sentiment.

“I don’t believe this, but I know others will, so I’m selling”

deskamess1y ago

> Nvidia being down 18%

The only part of DeepSeek-R1 I do not like. I hope it's over, but I am not holding my breath.

1 more reply

Kye1y ago

Huggingface is working on reproducing it: https://github.com/huggingface/open-r1

ryao1y ago

They claimed that it only took $5 million of GPUs to train Deepseek v3, which was the base model. They did not claim that the total costs were $5 million. They omitted the costs of additional hardware, electricity, personnel, training dataset acquisition, etcetera. They likely spent tens of times more on this at a minimum.

That said, what they did with $5 million of GPUs is impressive. Reportedly, they resorted to using PTX assembly to make it possible:

https://www.tomshardware.com/tech-industry/artificial-intell...

infecto1y ago

I think the jury is out. With folks trying to replicate the process we will see if the low budget is true or not. I am still on the fence, there was comments from Scale CEO that they have a huge number of H100s they used. On the market side I think regardless if this was true or not, this gave people the opportunity to sell what is perhaps overinflated valuations.

dinosaurdynasty1y ago

That's likely only the marginal cost of training this model, and doesn't include a lot of other costs, like the datacenters and GPUs themselves which they already had and also the staff.

If they aren't lying because they have hardware they're not supposed to have, which is also a possibility.

whimsicalism1y ago

these claims are getting more wrong every time i see them, weird game of telephone going around tech circles.

the cost absolutely includes the cost of GPUs and data centers, they quoted a standard price for renting h800 which has all of this built in. but yes, as very explicitly noted in the paper, it does not include cost of test iterations

whimsicalism1y ago

r1 probably cost way less to train, $5m is the alleged price tag for dsv3

bluesounddirect1y ago· 8 in thread

Hi small comment, please remember in china many things are sponsored by or subsidized by the government. "We[china] can do it for less.." , "it's cheaper in china.." only means the government gave us a pile of cash and help to get here .

I 100% expect some downvotes from the ccp.

kccqzy1y ago

And the United States subsidizes plenty of things too. For example the CHIPS act has $39 billion in subsidies for chip manufacturing on U.S. soil. There's nothing wrong with either country's subsidies. I personally don't believe in maximum free market. Government subsidy is more often than not a good thing and we need more of them both here and in China.

tivert1y ago

> Hi small comment, please remember in china many things are sponsored by or subsidized by the government. "We[china] can do it for less.." , "it's cheaper in china.." only means the government gave us a pile of cash and help to get here .

And that's a really important strategic advantage China has versus America, which has such an insane fixation on pure(ish) free markets and free trade that it gives away its advantages in strategic industry after strategic industry.

Some people falsely infer from the experience with the Soviet Union that freer markets always win geopolitical competition, but that's false.

syndicatedjelly1y ago

It’s false except for every time that it has been true

cynicalpeace1y ago

> And that's a really important strategic advantage China has versus America, which has such an insane fixation on pure(ish) free markets and free trade that it gives away its advantages in strategic industry after strategic industry.

> Some people falsely infer from the experience with the Soviet Union that freer markets always win geopolitical competition, but that's false.

The data we have is 500 years of free markets in the western world and the verdict is overwhelmingly: Yes, more freedom means more winning.

Just invite some incompetent bureaucrat over your house to dictate how you should cook and you'll quickly agree.

1 more reply

bluesounddirect1y ago

https://apnews.com/article/deepseek-china-generative-ai-inte... TA

blacklightpy1y ago

It's just the way it should be.

lucb1e1y ago

> I 100% expect some downvotes from the ccp.

Always happy to oblige when someone insinuates that any critics must be government agents

bluesounddirect1y ago

Naa just wanted to see who would bite.

raghavbali1y ago· 6 in thread

> Unfortunately if you naively quantize all layers to 1.58bit, you will get infinite repetitions in seed 3407: “Colours with dark Colours with dark Colours with dark Colours with dark Colours with dark” or in seed 3408: “Set up the Pygame's Pygame display with a Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's”.

This is really interesting insight (although other works cover this as well). I am particularly amused by the process by which the authors of this blog post arrived at these particular seeds. Good work nonetheless!

danielhanchen1y ago

Hey! :) Coincidentally the seeds I always use are 3407, 3408 and 3409 :) 3407 because of https://arxiv.org/abs/2109.08203

I also tried not setting the seeds, but the results are still the same - quantizing all layers seems to make the model forget and repeat everything - I put all examples here: https://docs.unsloth.ai/basics/deepseek-r1-dynamic-1.58-bit#...

iamnotagenius1y ago

would be great to have dynamic quants of V3-non-R1 version, as for some tasks it is good enough. Also would be very interesting to see degradation with dynamic quants on small/medium size MoEs, such as older Deepseek models, Mixtrals, IBM tiny Granite MoE. Would be fun if Granite 1b MoE will still be functioning at 1.58bit.

1 more reply

littlestymaar1y ago

Can't this kind of repetition be dealt with at the ~~decoder~~ (edit: sampler) level, like for any models? (see DRY ~~decoder~~ sampler for instance: https://github.com/oobabooga/text-generation-webui/pull/5677)

danielhanchen1y ago

Oh yes one could provide a repetition penalty for example - the issue is it's not just repetition that's the issue. I find it rather forgets what it already saw, and so hence it repeats stuff - it's probably best to backtrack, then delete the last few rows in the KV cache.

Another option is to employ min_p = 0.05 to force the model not to generate low prob tokens - it can help especially in the case when the 1.58bit model generates on average 1/8000 tokens or so an "incorrect" token (for eg `score := 0`)

reichardt1y ago

You likely mean sampler, not decoder. And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy. If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.

1 more reply

ErikBjare1y ago

You can deal with this through various sampling methods, but it doesn't actually fix the fried model.

mtrovo1y ago· 5 in thread

Wow, an 80% reduction in size for DeepSeek-R1 is just amazing! It's fantastic to see such large models becoming more accessible to those of us who don't have access to top-tier hardware. This kind of optimization opens up so many possibilities for experimenting at home.

I'm impressed by the 140 tokens per second speed with the 1.58-bit quantization running on dual H100s. That kind of performance makes the model practical for small or mid sized shops to use it for local applications. This is a huge win for people working on agents that require low latency that only local models could support.

paradite1y ago

Btw completely off topic, but your comment triggered the internal classification in my brain, and it looks like AI-generated.

Not accusing you anything. Could be that you happen to write in a way similar to LLMs. Could be that we are influenced by LLM writing styles and are writing more and more like LLMs. Could be that the difference between LLM generated content and human-generated content is getting smaller and harder to tell.

j_bum1y ago

+1 my LLM spidy senses were tingling.

It’s the exclamation point in the first paragraph, the concise and consistent sentence structure, and the lack of colloquial tone.

OP, no worries if you’re real. I often read my own messages or writing and worry that people will think I’m an LLM too.

1 more reply

mtrovo1y ago

haha you got me. I'm real person using LLM to proofread the stuff I write. English is not my native language and I'm trying to improve my written vocabulary a little bit. Sorry if it reads a little bit too off.

1 more reply

ahmeneeroe-v21y ago

Very funny, I didn't mentally jump to LLM, but the language was so lifeless that I stopped reading.

Amazing that OP confirmed you're correct (and good use of LLM @OP).

danielhanchen1y ago

I was pleasantly surprised by 140 tokens/s as well! I literally thought I did something wrong but it was real!

ThePhysicist1y ago· 5 in thread

In general, how do you run these big models on cloud hardware? Do you cut them up layer-wise and run slices of layers on individual A100/H100s?

phire1y ago

My understanding is with MoE (Mixture of Experts), you can and should shard it horizontally. The whole model is 600GB, but only 37GB is active during the evaluation of any single output token.

So you can load a different active subset of the MoE into each 89GB GPU, sharding it across something like 32 different GPUs (or can you get away with less? Wouldn't be surprised if they can infer on 8x H800 gpus). Some parameters are common, others are independent. Queries can be dynamically routed between GPUs, potentially bouncing between GPUs as much as once per output token, depending on which experts they need to activate.

Though, I suspect it's normal to stick on one MoE subset for several output tokens.

This has a secondary benefit that as long as the routing distribution is random, queries should be roughly load balanced across all GPUs.

yorwba1y ago

Each MoE layer has its own router, and it activates 8 (out of 256) experts at a time. There's no reason to expect all of them to stay on the same GPU, so you're pretty much guaranteed to have to do all-to-all communication between the GPUs in your cluster after every layer for every token.

1 more reply

danielhanchen1y ago

There are a few ways - the most basic is per layer sharding - DeepSeek uses 3 dense layers, so that can stay on GPU0 (with the embedding layer). There's 58 MoE layers (256 experts, 8 activated) and 1 shared expert per layer. GPU1 would house layers 3 to 9, and so on.

Then by using pipeline parallelism, if a new request comes, we simply stick them in a queue - GPUs 0, 1, 2, ..., 8. Request A is at GPU 2, Request B at GPU 1, Request C at GPU 0 and so on.

The other option is tensor parallelism were we split the weights evenly. You could combine pipeline and tensor parallelism as well!

amelius1y ago

You could do that, and add pipelining to improve speed.

teekert1y ago

Was wondering the same, but for HPC clusters :)

tarruda1y ago· 3 in thread

Would be great if the next generation of base models was designed to be inferred with 128GB of VRAM while 8bit quantized (which would fit in the consumer hardware class).

For example, I imagine a strong MoE base with 16 billion active parameters and 6 or 7 experts would keep a good performance while being possible to run on 128GB RAM macbooks.

danielhanchen1y ago

So I remember Deepseek used float8 for training - Character AI also used int8 for training - it is indeed possible, but sometimes training can be unstable - Deepseek to my knowledge is actually the first lab to use float8 at a large scale without causing loss spikes - they used FP8 tensor cores, then every 4th matrix multiply, they accumulated to a FP32 accumulator - it seems like the Hopper Tensor Cores accumulation mechanism might not be actual FP32 accumulation. I wrote more here: https://x.com/danielhanchen/status/1872719599029850391

Davidzheng1y ago

Would be great, but unfortunately i think intelligence at that compute scale will be limit by hardware not its model. Though at hardware limit I would expect it to be roughly human level especially if optimized for a particular domain.

tarruda1y ago

I remember that Llama 3 was trained on data curated by Llama 2 and it resulted in a model with a significant performance boost (even though it was trained by a previous generation model of the same size).

Maybe using a strong reasoning model such as R1 the next generation, even more performance can be extracted from smaller models.

1 more reply

miohtama1y ago· 3 in thread

Flappy Bird in Python is the new Turing test

danielhanchen1y ago

:) It's my goto test :) I did amp it up by adding 10 conditions and made a scoring card - I found the original R1 to sometimes forget "import os" or miss some lines as well, so I thought it was at least a good check!

I also like to ask the models to create a simple basic Minecraft type game where you can break pieces and store them in your inventory, but disallow building stuff

miohtama1y ago

I feel any AI can fix those problems when they can finally act. The problem AIs cannot run or debug code, or even book a hotel for me. When that is solved and an AI can interact with the code like a human does, it can fix its problems like a human does.

1 more reply

mclau1561y ago

hopefully we eventually push them to make more classic games like motherlode

xiphias21y ago· 3 in thread

Has it been tried on 128GB M4 MacBook Pro? I'm gonna try it, but I guess it will be too slow to be usable.

I love the original DeepSeek model, but the distilled versions are too dumb usually. I'm excited to try my own queries on it.

rahimnathwani1y ago

  I love the original DeepSeek model, but the distilled versions are too dumb usually.

Apart from being dumber, they also don't know as much as R1. I can see how fine-tuning can improve reasoning capability (by showing examples of good CoT) but there's no reason that would improve the knowledge of facts (relative to the Qwen or Llama model on which the finetuning was based).

emseetech1y ago

I'm downloading it now and will report back.

(I've been using the 32B and while it could always be better, I'm not unhappy with it)

TheTaytay1y ago

How'd it go, and which client are you using? :)

1 more reply

sylware1y ago· 3 in thread

site is javascript walled

80%? On 2 H100 only? To get near chatgpt 4? Seriously? The 671B version??

whimsicalism1y ago

they have not benchmarked the quantized model.

fsflover1y ago

> site is javascript walled

I use Qubes OS to protect myself from the JS.

sylware1y ago

That site should work with a noscript/basic (x)html browser.

DogRunner1y ago· 2 in thread

>For optimal performance, we recommend the sum of VRAM + RAM to be at least 80GB+.

Oh nice! So I can try it in my local "low power/low cost" server at home.

My homesystem does run in a ryzen 5500 + 64gb RAM + 7x RTX 3060 12gb

So 64gb RAM plus 84gb VRAM

I dont want to brag around, but point to solutions for us tinkerers with a small budget and high energy costs.

such system can be build for around 1600 euro. The power consumption is around 520 watt.

I started with a AM4 Board (b450 Chipset) and one used RTX 3060 12gb which cost around 200 Euro used if you are patient.

There every additional GPU is connected with the pcie riser/extender to give the cards enough space.

After a while I had replaces the pcie cards with a single pcie x4 to 6x PCIe x1 extender.

It runs pretty nice. Awesome to learn and gain experience

tucnak1y ago

How are you arriving at those numbers?

ryzen 5500 + 7x3060 + cooling ~= 1.6 kW off the wall, at 360 GB/s memory bandwidth, and considering your lane budget, most of it will be wasted in single PCIe lanes. After-market unit price of 3060's is 200 eur, so 1600 is not good-faith cost estimate.

From the looks of it, your setup is neither low-power, nor low-cost. You'd be better served with a refurbished mac studio (2022) at 400GB/s bandwidth fully utilised over 96 GB memory. Yes, it will cost you 50% more (considering real cost of such system closer to 2000 eur) however it would run at a fraction of power use (10x less, more or less)

I get it that hobbyists like to build PC's, but claiming that sticking seven five year out of date low-bandwidth GPU's in a box is "low power/low cost" is a silly proposition.

You're advocating for e-waste

benjiro1y ago

The issue is that you are taking max GPU power draw, as a given. Running a LLM does not tax a GPU the same way a game does. There is a rather know Youtuber, that ran LLMs on a 4090, and the actual power draw was only 130W on the GPU.

Now add that this guy has 7x3060 = 100% miner. So you know that he is running a optimized profile (underclocked).

Fyi, my gaming 6800 draws 230W, but with a bit of undervolting and sacrificing 7% performance, it runs at 110W for the exact same load. And that is 100% taxed. This is just a simple example to show that a lot of PC hardware runs very much overclocked/unoptimized out of the box.

Somebody getting down to 520W sounds perfectly normal, for a undervolted card that gives up maybe 10% performance, for big gains in power draw.

And no, old hardware can be extreme useful in the right hands. Add to this, its the main factor that influences the speed tends to be more memory usage (the more you can fit and the interconnects), then actual processing performance for running a LLM.

Being able to run a large model for 1600 sounds like a bargain to me. Also, remember, when your not querying the models, the power will be mostly the memory wakes + power regulators. Coming back to that youtuber, he was not constantly drawing that 130W, it was only with spikes when he ran prompts or did activity.

Yes, running from home will be more expensive then a 10$ copilot plan but ... nobody is also looking at your data ;)

2 more replies

cubefox1y ago· 2 in thread

For anyone wondering why "1.58" bits: 2^1.58496... = 3. The weights have one of the three states {-1, 0, 1}.

dist-epoch1y ago

They say something else:

> We managed to selectively quantize certain layers to higher bits (like 4bit), and leave most MoE layers (like those used in GPT-4) to 1.5bit

cubefox1y ago

That was just improper rounding from 1.58 to 1.5. They say 1.58 in other places and explicitly link to https://arxiv.org/abs/2402.17764

amusingimpala751y ago· 2 in thread

> DeepSeek-R1 has been making waves recently by rivaling OpenAI's O1 reasoning model while being fully open-source.

Do we finally have a model with access to the training architecture and training data set, or are we still calling non-reproducible binary blobs without source form open-source?

stackedinserter1y ago

It sounds like if they owe you the training architecture and training data set.

chris_pie1y ago

It absolutely doesn't. It sounds like further diluting the term "open-source" isn't great.

1 more reply

ggm1y ago· 2 in thread

If I invested in a 100x machine because I needed 100 of x to run, and somebody shows how 10x can work, why have I not just become the holder of 10 10x machines, and therefore have already achieved capex to exploit this new market?

I cannot understand why "openai is dead" has legs: repurpose the hardware and data and it can be multiple instances of the more efficient model.

stevenhuang1y ago

because of discounted cash flow/valuation models.

you invest in a 100x machine expecting a revenue of X, but now you can only charge X/100 because R1 shows that AI inference can be done much more efficiently. see the price decrease of ChatGPT and addition of free O3 etc.

this reduction of future cash flows, ceteris paribus, implies that the present value of these cash flows decrease. this then results in massive repricing to the downside as market participants update their forecasts.

what you are missing is that to assume as you do, you must make the additional assumption that demand for additional compute is infinite. Which may very well be the case, but it is not guaranteed compared to the present realized fact that R1 means lower revenues for AI inference providers -> changes the capex justification for even more hardware -> NVDA receives less revenue.

ggm1y ago

Thanks. Always a mistake to assume the price of something is bound to your own cost of doing it: the price is bound to the other guys cheaper price!

upghost1y ago· 2 in thread

Thanks for the run instructions, unsloth. Deepseek is so new it's been breaking most of my builds.

marcodiego1y ago

This is an important step. Especially for beginners or people who are not in the loop, being able to easily type some simple commands to download, install dependencies, compile and run everything needed for a LLM AI model gives a feeling sci-fi; it's almost like you can have a helping brain at home.

One thing I've being thinking about doing is to combine one of those LLM models running in llama.cpp, feed it with the output of whisper.cpp and connect its output to some TTS model. I wonder how far from Wheels and Roadie from the Pole Position tv series.

danielhanchen1y ago

Glad they were helpful! :)

TheTaytay1y ago· 1 in thread

Danielhanchen, your work is continually impressive. Unsloth is great, and I’m repeatedly amazed at your ability to get up to speed on a new model within hours of its release, and often fix bugs in the default implementation. At this point, I think serious labs should give you a few hour head start just to iron out their kinks!

danielhanchen1y ago

Oh thanks a lot! Appreciate it :) We're always open to collaborating with anyone!

afro881y ago· 1 in thread

The size reduction while keeping the model coherent is incredible. But I'm skeptical of how much effectiveness was retained. Flappy bird is well known and the kind of thing a non-reasoning model could het right. A better test would be something off the beaten path that R1 and o1 get right that other models don't.

whimsicalism1y ago

yeah it is pretty unclear how lobotomized it is without benchmark.

i’ve gotten full fp8 running on 8xh100, probably going to keep doing that

Pxtl1y ago· 1 in thread

Is there any good quick summary of what's special about DeepSeek? I know it's OSS and incredibly efficient, but news laymen are saying it's trained purely on AI info instead of using a corpus of tagged data... which, I assume, means it's somehow extracting weights or metadata or something from other AIs. Is that it?

rahimnathwani1y ago

  Is there any good quick summary of what's special about DeepSeek?

Yes, section 2.3 of the Deepseek R1 paper summarizes the training part you're asking about, in less than a page.

https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSee...

beernet1y ago· 1 in thread

Big fan of unsloth, they have huge potential, could definitely need some experienced GTM people though, IMO. The pricing page and messages sent there are really not good.

danielhanchen1y ago

Oh thanks :) Yes agreed we do need better GTM - temporarily it's still me and my brother running Unsloth, so for now we're just prioritizing many more engineering releases :)

techwiz1371y ago· 1 in thread

How can you have a bit and a half exactly? It doesn't make sense.

dosinga1y ago

It's not a bit and a half. It is 1.58 or really log(3) / log(2) since it allows for three values, -1, 0 an 1

CodeCompost1y ago· 1 in thread

Can I run this on ollama?

benoitg1y ago

Yes, the instructions are in the OP.

petesergeant1y ago· 1 in thread

It is going to be truly fucking revolutionary if open-source models are and continue to be able to challenge the state of the art. My big philosophical concern is that AI locks Capital into an absolutely supreme and insurmountable lead over Labour, and into the hands of oligarchs, and the possibility of a future where that's not case feels amazing. It pleases me greatly that this has Trump riled up too, because I think it means he's much less likely to allow existing US model-makers to build moats, as I think he's -- even as a man who I don't think believes in very much -- absolutely unwilling to let the Chinese get the drop on him over this.

fullstackchris1y ago

I have no doubt open source will catch up (it already has, eh?) at the end of the day, it's just creative / new iterations on what is ultimately the transformer architecture... the amount of "secret" moat-like stuff that OpenAI was doing was bound to be figured out or exceeded eventually, like everything in tech...

Not to make fun of OpenAI and the great work they've done but it's kinda like if I went out in the 90s and said I'm going to found a company to have the best REST APIs. You can always found a successful tech company, but you can't found a successful tech company on a technological architecture or pattern alone.

hendersoon1y ago

The size reduction is impressive but unless I missed it, they don't list any standard benchmarks for comparison so we have no way to tell how it compares to the full-size model.

Dwedit1y ago

Is this actually 1.58 bits? (Log base 2 of 3) I heard of another "1.58 bit" model that actually used 2 bits instead. "1.6 bit" is easy enough, you can pack five 3-state values into a byte by using values 0-242. Then unpacking is easy, you divide and modulo by 3 up to five times (or use a lookup table).

danesparza1y ago

Just ask it about Taiwan (not kidding). I'm not sure I can trust a model that has such a focused political agenda.

MyFirstSass1y ago

Is this akin to the quants already being done to various models when you download a GGUF at 4 bits for example, or is this variable layer compression something new that can also be make existing smaller models smaller so we can fit more into say 12 or 16 gb's of vram?

slewis1y ago

It would be really useful to see these evaluated across some of the same evals that the original R1 and deepseek's distills were evaluated on.

patleeman1y ago

Incredible work by the Unsloth brothers again. It’s really cool to see bitnet quantization implemented like this.

CHB04030854821y ago

DeepSeek R1 in a nutshell

youtube.com/watch?v=Nl7aCUsWykg

indigodaddy1y ago

Is there any small DS or qwen model that could run on say an M4 Mac Mini Standard (16G) ?

mclau1561y ago

Is the new LLM benchmark to create flappy bird in pygame?

homarp1y ago

j / k navigate · click thread line to collapse

332 comments

154 comments · 33 top-level

apples_oranges1y ago· 31 in thread

Random observation 2: It's time to cancel the OpenAI subscription.

yobid201y ago

sailingparrot1y ago

You want to bet? The panic around deepseek is getting completely disconnected from reality.

Compute is still king and OpenAI has worked on their training platform longer than anyone.

Of course as soon as the next best model is released, we can train on its output and catch up at a fraction of the cost, and thus the infinite bunny hopping will continue.

But OpenAI is very much alive.

7 more replies

generalizations1y ago

Need an LLM to one-shot some complex network scripting? as of last night, o1 is still where its at.

2 more replies

wqaatwt1y ago

IMHO o1 it’s still comparable to a lot better for accomplishing actual stuff than DeepSeek. At least for my use cases.

Of course cost is incomparably higher since plus has a very low limit. Which of course is a huge deal.

conradfr1y ago

Why every time there is a new model all the other competitors are declared immediately dead?

2 more replies

nicce1y ago

1. You can get all the models by buying Kagi subscription (excluding o1). Includes DeepSeek models. You can also feed the assistant with search data that you can filter.

2. If you have GitHub Copilot, you get o1 chat also there.

I haven't seen much value with OpenAI subscription for ages.

1 more reply

kebaman1y ago

Doesn't Microsoft own 49% of OpenAI? They'll end up with it all as a division of Microsoft.

1 more reply

yieldcrv1y ago

I disagree, I don't really need "conversational chat responses", I need multimodal

There is a huge utility for me, and many others that dont know it yet, if we could just load a couple models at once that work together seamlessly in a single seamless GUI like how ChatGPT works.

immibis1y ago

regularfry1y ago

The 5090 is 32GB out of the box. Not that that's anywhere near the top of what you can do on an Apple, but at least it's movement.

TeMPOraL1y ago

> They do this so you'll have to buy several cards for your AI workstation.

1 more reply

therealpygon1y ago

1 more reply

sliken1y ago

Or buy 2 Nvidia digits for $6,000 to get 256GB vram.

anakaine1y ago

neom1y ago

1 more reply

mst1y ago

gradus_ad1y ago

Were you running a local model?

nodja1y ago

xbmcuser1y ago

https://www.nvidia.com/en-us/project-digits/

1 more reply

orf1y ago

Can I use that on the train though? I can with a 128GB MacBook, without it sounding like a helicopter taking off as well.

5 more replies

phkahler1y ago

>> While 192GB of ram is appealing, it's also quite expensive at $6000.

That's because it's Apple. It time to start moving to AMD systems with shared memory. My Zen 3 APU system has 64GB these days and its a mini ITX board.

1 more reply

yobid201y ago

The power requirement for 5x5090s is 10x higher , so you'll spend far more than $6000 in electricity over time.

1 more reply

miohtama1y ago

5x 3090 is also much more power hungry?

1 more reply

danielhanchen1y ago

The good thing is since MoEs are mainly memory bound, we just need (VRAM + RAM) to be in the range of 80GB or so in my tests for at least 5 tokens or so /s.

It's better to get (VRAM + RAM) >= 140GB for at least 30 to 40 tokens/s, and if VRAM >= 140GB, then it can approach 140 tokens/s!

Another trick is to accept more than 8 experts per pass - it'll be slower, but might be more accurate. You could even try reducing the # of experts to say 6 or 7 for low FLOP machines!

danielhanchen1y ago

Oh yes 192GB machines should be able these quants (131GB for 1.58bit, 158GB for 1.73bit, 183GB for 2.22bit) well :)

bradfox21y ago

Great release Daniel. Applaud the consistency you have shown.

Can you release slightly bigger quant versions? Would enjoy something that runs well on 8x32 v100 and 8x80 A100.

1 more reply

moffkalast1y ago

Yes, shared memory is a pretty big leg up since it lets the GPU process the whole model even if the bandwidth is slower which still has some benefits.

ant6n1y ago

throw-qqqqq1y ago

3 more replies

immibis1y ago

1 more reply

mory911y ago

idk, in my daily work i still see o1 being more useful, did you observe both having the same reasoning power?

Jasondells1y ago· 25 in thread

danielhanchen1y ago

Oh the repetition issue is only on the non dynamic quants :) If you do dynamic quantization and use the 1.58bit dynamic quantized model the repetition issue fully disappears!

Min_p = 0.05 was a way I found to counteract the 1.58bit model generating singular incorrect tokens which happen around 1 token per 8000!

smcleod1y ago

min_p is great, do you apply a small amount of temperate as well?

2 more replies

huijzer1y ago

> That said, I’m still skeptical about how practical this really is for most people.

I'm running Open WebUI for months now for me and some friends as a front-end to one of the API providers (deepinfra in my case, but there are many others, see https://artificialanalysis.ai/).

danielhanchen1y ago

Oh I love Open WebUI as well!! But glad to hear the 1.58bit version could be helpful to you!

rafaelmn1y ago

>Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup?

sliken1y ago

Keep in mind the strix halo APU has a 256 bit wide memory bus and the Mac Ultra has a 1024 bit wide memory bus.

Here's hoping the Nvidia Digit (GB10 chip) has a 512 bit or 1024 bit wide interface, otherwise the Strix Halo will be the best you can do if you don't get the Mac Ultra.

1 more reply

dagmx1y ago

Unfortunately, Apple’s RAM and Storage upgrade prices are very in line with other class comparable OEMs.

I’m sure there’ll be some amount of undercutting but I don’t think it’ll be a huge difference on the RAM side itself.

2 more replies

jairuhme1y ago

michaelt1y ago

Personally I've noticed major changes in performance between different quantisations of the same model.

This makes me reluctant to evaluate new models in heavily quantised forms, as you're measuring the quantisation more than the actual model.

2 more replies

smcleod1y ago

eurekin1y ago

I have similar set-up - can you help out with running it? Was it in ollama?

EDIT: It seems that original authors provided a nice write-up:

https://unsloth.ai/blog/deepseekr1-dynamic#:~:text=%F0%9F%96...

1 more reply

ryao1y ago

Which one did you run?

elorant1y ago

Not everyone needs the largest model. There are variations or R1 with fewer parameters that can easily run on consumer hardware. With 80% size reduction you could run 70B on 8-bit on an RTX 3090.

whimsicalism1y ago

1 more reply

Kye1y ago

This line in the stuff inside the <think> section suggests it's also been trained on YouTube clips:

>> "I'm not entirely sure if I got all the details right, but this is what I remember from watching clips and summaries online."

An excerpt from the generated summary:

"It explores themes of international espionage, space warfare, and humanity's role in the cosmos" is the closest to correct line in the whole output.

2 more replies

brookst1y ago

Or if you want a large model but don’t need high performance, get a Mac with 128GB UMA.

1 more reply

F7F7F71y ago

People would only be 'throwing their hands up' because commercial LLMs have set unreasonable expectations for folks.

Anyone who has a/the need for or understands the value of a local LLM would be OK with this kind of output.

bnchrch1y ago

Everyone has the need for on device LLM, if the response rate was fast!

2 more replies

goosejuice1y ago

I use commercial LLMs every day. The best of them can still be infuriating at times to the point of being unproductive. So I'm not sure I agree here.

ricardobeat1y ago

The repetition issue happens on simple quantization, what they are releasing is an approach that fixes that.

danielhanchen1y ago

Yes exactly! I edited the blog post to make the wording a bit better!

JKCalhoun1y ago

Layman here — but I am hopeful for 1.58 bit plus custom silicon to be the Holy Grail. I suppose I am setting high expectations on Apple to integrate said in their next "A" chip.

Wishful thinking.

danielhanchen1y ago

Ye a custom chip would be insane! 1.5 bit with a scaling factor seems to be actually usable for MoEs with shared experts!

sliken1y ago

I do want a 192GB Mac Ultra, I'm hoping the Nvidia Digit achieves similar at $3,000. Sadly no specifications or benchmarks, so tokens/sec is just a guess at this point.

yodsanklai1y ago

> I’d rather build a rig with used 3090s and get way more bang for my buck

I'm curious, what would you use that rig for?

brap1y ago· 12 in thread

As someone who is out of the loop, what’s the verdict on R1? Was anyone able to reproduce the results yet? Is the claim that it only took $5M to train generally accepted?

It’s a very bold claim which is really shaking up the markets, so I can’t help but wonder if it was even verified at this point.

huijzer1y ago

> Is the claim that it only took $5M to train generally accepted?

Based on Nvidia being down 18% yesterday I would say the claim is generally accepted.

willsmith721y ago

Because the markets are rational, all-knowing, and have never been wrong?

3 more replies

tarruda1y ago

It is still unconfirmed since no one outside of deepseek reproduced it.

If confirmed, Nvidia could go down even more

2 more replies

infecto1y ago

While Deepseek was an instigator in the price movements I would not say its accepted.

afavour1y ago

I don’t see them as related. The market moves when there is money to be made. It’s only tangentially related to any kind of general sentiment.

“I don’t believe this, but I know others will, so I’m selling”

deskamess1y ago

> Nvidia being down 18%

The only part of DeepSeek-R1 I do not like. I hope it's over, but I am not holding my breath.

1 more reply

Kye1y ago

Huggingface is working on reproducing it: https://github.com/huggingface/open-r1

ryao1y ago

That said, what they did with $5 million of GPUs is impressive. Reportedly, they resorted to using PTX assembly to make it possible:

https://www.tomshardware.com/tech-industry/artificial-intell...

infecto1y ago

dinosaurdynasty1y ago

That's likely only the marginal cost of training this model, and doesn't include a lot of other costs, like the datacenters and GPUs themselves which they already had and also the staff.

If they aren't lying because they have hardware they're not supposed to have, which is also a possibility.

whimsicalism1y ago

these claims are getting more wrong every time i see them, weird game of telephone going around tech circles.

whimsicalism1y ago

r1 probably cost way less to train, $5m is the alleged price tag for dsv3

bluesounddirect1y ago· 8 in thread

I 100% expect some downvotes from the ccp.

kccqzy1y ago

tivert1y ago

Some people falsely infer from the experience with the Soviet Union that freer markets always win geopolitical competition, but that's false.

syndicatedjelly1y ago

It’s false except for every time that it has been true

cynicalpeace1y ago

> Some people falsely infer from the experience with the Soviet Union that freer markets always win geopolitical competition, but that's false.

The data we have is 500 years of free markets in the western world and the verdict is overwhelmingly: Yes, more freedom means more winning.

Just invite some incompetent bureaucrat over your house to dictate how you should cook and you'll quickly agree.

1 more reply

bluesounddirect1y ago

https://apnews.com/article/deepseek-china-generative-ai-inte... TA

blacklightpy1y ago

It's just the way it should be.

lucb1e1y ago

> I 100% expect some downvotes from the ccp.

Always happy to oblige when someone insinuates that any critics must be government agents

bluesounddirect1y ago

Naa just wanted to see who would bite.

raghavbali1y ago· 6 in thread

danielhanchen1y ago

Hey! :) Coincidentally the seeds I always use are 3407, 3408 and 3409 :) 3407 because of https://arxiv.org/abs/2109.08203

iamnotagenius1y ago

1 more reply

littlestymaar1y ago

danielhanchen1y ago

reichardt1y ago

1 more reply

ErikBjare1y ago

You can deal with this through various sampling methods, but it doesn't actually fix the fried model.

mtrovo1y ago· 5 in thread

paradite1y ago

Btw completely off topic, but your comment triggered the internal classification in my brain, and it looks like AI-generated.

j_bum1y ago

+1 my LLM spidy senses were tingling.

It’s the exclamation point in the first paragraph, the concise and consistent sentence structure, and the lack of colloquial tone.

OP, no worries if you’re real. I often read my own messages or writing and worry that people will think I’m an LLM too.

1 more reply

mtrovo1y ago

1 more reply

ahmeneeroe-v21y ago

Very funny, I didn't mentally jump to LLM, but the language was so lifeless that I stopped reading.

Amazing that OP confirmed you're correct (and good use of LLM @OP).

danielhanchen1y ago

I was pleasantly surprised by 140 tokens/s as well! I literally thought I did something wrong but it was real!

ThePhysicist1y ago· 5 in thread

In general, how do you run these big models on cloud hardware? Do you cut them up layer-wise and run slices of layers on individual A100/H100s?

phire1y ago

My understanding is with MoE (Mixture of Experts), you can and should shard it horizontally. The whole model is 600GB, but only 37GB is active during the evaluation of any single output token.

Though, I suspect it's normal to stick on one MoE subset for several output tokens.

This has a secondary benefit that as long as the routing distribution is random, queries should be roughly load balanced across all GPUs.

yorwba1y ago

1 more reply

danielhanchen1y ago

Then by using pipeline parallelism, if a new request comes, we simply stick them in a queue - GPUs 0, 1, 2, ..., 8. Request A is at GPU 2, Request B at GPU 1, Request C at GPU 0 and so on.

The other option is tensor parallelism were we split the weights evenly. You could combine pipeline and tensor parallelism as well!

amelius1y ago

You could do that, and add pipelining to improve speed.

teekert1y ago

Was wondering the same, but for HPC clusters :)

tarruda1y ago· 3 in thread

Would be great if the next generation of base models was designed to be inferred with 128GB of VRAM while 8bit quantized (which would fit in the consumer hardware class).

For example, I imagine a strong MoE base with 16 billion active parameters and 6 or 7 experts would keep a good performance while being possible to run on 128GB RAM macbooks.

danielhanchen1y ago

Davidzheng1y ago

tarruda1y ago

Maybe using a strong reasoning model such as R1 the next generation, even more performance can be extracted from smaller models.

1 more reply

miohtama1y ago· 3 in thread

Flappy Bird in Python is the new Turing test

danielhanchen1y ago

I also like to ask the models to create a simple basic Minecraft type game where you can break pieces and store them in your inventory, but disallow building stuff

miohtama1y ago

1 more reply

mclau1561y ago

hopefully we eventually push them to make more classic games like motherlode

xiphias21y ago· 3 in thread

Has it been tried on 128GB M4 MacBook Pro? I'm gonna try it, but I guess it will be too slow to be usable.

I love the original DeepSeek model, but the distilled versions are too dumb usually. I'm excited to try my own queries on it.

rahimnathwani1y ago

  I love the original DeepSeek model, but the distilled versions are too dumb usually.

emseetech1y ago

I'm downloading it now and will report back.

(I've been using the 32B and while it could always be better, I'm not unhappy with it)

TheTaytay1y ago

How'd it go, and which client are you using? :)

1 more reply

sylware1y ago· 3 in thread

site is javascript walled

80%? On 2 H100 only? To get near chatgpt 4? Seriously? The 671B version??

whimsicalism1y ago

they have not benchmarked the quantized model.

fsflover1y ago

> site is javascript walled

I use Qubes OS to protect myself from the JS.

sylware1y ago

That site should work with a noscript/basic (x)html browser.

DogRunner1y ago· 2 in thread

>For optimal performance, we recommend the sum of VRAM + RAM to be at least 80GB+.

Oh nice! So I can try it in my local "low power/low cost" server at home.

My homesystem does run in a ryzen 5500 + 64gb RAM + 7x RTX 3060 12gb

So 64gb RAM plus 84gb VRAM

I dont want to brag around, but point to solutions for us tinkerers with a small budget and high energy costs.

such system can be build for around 1600 euro. The power consumption is around 520 watt.

I started with a AM4 Board (b450 Chipset) and one used RTX 3060 12gb which cost around 200 Euro used if you are patient.

There every additional GPU is connected with the pcie riser/extender to give the cards enough space.

After a while I had replaces the pcie cards with a single pcie x4 to 6x PCIe x1 extender.

It runs pretty nice. Awesome to learn and gain experience

tucnak1y ago

How are you arriving at those numbers?

I get it that hobbyists like to build PC's, but claiming that sticking seven five year out of date low-bandwidth GPU's in a box is "low power/low cost" is a silly proposition.

You're advocating for e-waste

benjiro1y ago

Now add that this guy has 7x3060 = 100% miner. So you know that he is running a optimized profile (underclocked).

Somebody getting down to 520W sounds perfectly normal, for a undervolted card that gives up maybe 10% performance, for big gains in power draw.

Yes, running from home will be more expensive then a 10$ copilot plan but ... nobody is also looking at your data ;)

2 more replies

cubefox1y ago· 2 in thread

For anyone wondering why "1.58" bits: 2^1.58496... = 3. The weights have one of the three states {-1, 0, 1}.

dist-epoch1y ago

They say something else:

> We managed to selectively quantize certain layers to higher bits (like 4bit), and leave most MoE layers (like those used in GPT-4) to 1.5bit

cubefox1y ago

That was just improper rounding from 1.58 to 1.5. They say 1.58 in other places and explicitly link to https://arxiv.org/abs/2402.17764

amusingimpala751y ago· 2 in thread

> DeepSeek-R1 has been making waves recently by rivaling OpenAI's O1 reasoning model while being fully open-source.

Do we finally have a model with access to the training architecture and training data set, or are we still calling non-reproducible binary blobs without source form open-source?

stackedinserter1y ago

It sounds like if they owe you the training architecture and training data set.

chris_pie1y ago

It absolutely doesn't. It sounds like further diluting the term "open-source" isn't great.

1 more reply

ggm1y ago· 2 in thread

I cannot understand why "openai is dead" has legs: repurpose the hardware and data and it can be multiple instances of the more efficient model.

stevenhuang1y ago

because of discounted cash flow/valuation models.

ggm1y ago

Thanks. Always a mistake to assume the price of something is bound to your own cost of doing it: the price is bound to the other guys cheaper price!

upghost1y ago· 2 in thread

Thanks for the run instructions, unsloth. Deepseek is so new it's been breaking most of my builds.

marcodiego1y ago

danielhanchen1y ago

Glad they were helpful! :)

TheTaytay1y ago· 1 in thread

danielhanchen1y ago

Oh thanks a lot! Appreciate it :) We're always open to collaborating with anyone!

afro881y ago· 1 in thread

whimsicalism1y ago

yeah it is pretty unclear how lobotomized it is without benchmark.

i’ve gotten full fp8 running on 8xh100, probably going to keep doing that

Pxtl1y ago· 1 in thread

rahimnathwani1y ago

  Is there any good quick summary of what's special about DeepSeek?

Yes, section 2.3 of the Deepseek R1 paper summarizes the training part you're asking about, in less than a page.

https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSee...

beernet1y ago· 1 in thread

Big fan of unsloth, they have huge potential, could definitely need some experienced GTM people though, IMO. The pricing page and messages sent there are really not good.

danielhanchen1y ago

Oh thanks :) Yes agreed we do need better GTM - temporarily it's still me and my brother running Unsloth, so for now we're just prioritizing many more engineering releases :)

techwiz1371y ago· 1 in thread

How can you have a bit and a half exactly? It doesn't make sense.

dosinga1y ago

It's not a bit and a half. It is 1.58 or really log(3) / log(2) since it allows for three values, -1, 0 an 1

CodeCompost1y ago· 1 in thread

Can I run this on ollama?

benoitg1y ago

Yes, the instructions are in the OP.

petesergeant1y ago· 1 in thread

fullstackchris1y ago

hendersoon1y ago

The size reduction is impressive but unless I missed it, they don't list any standard benchmarks for comparison so we have no way to tell how it compares to the full-size model.

Dwedit1y ago

danesparza1y ago

Just ask it about Taiwan (not kidding). I'm not sure I can trust a model that has such a focused political agenda.

MyFirstSass1y ago

slewis1y ago

It would be really useful to see these evaluated across some of the same evals that the original R1 and deepseek's distills were evaluated on.

patleeman1y ago

Incredible work by the Unsloth brothers again. It’s really cool to see bitnet quantization implemented like this.

CHB04030854821y ago

DeepSeek R1 in a nutshell

youtube.com/watch?v=Nl7aCUsWykg

indigodaddy1y ago

Is there any small DS or qwen model that could run on say an M4 Mac Mini Standard (16G) ?

mclau1561y ago

Is the new LLM benchmark to create flappy bird in pygame?

homarp1y ago

j / k navigate · click thread line to collapse