https://openrouter.ai/deepseek/deepseek-r1-0528/providers
May 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass.
Fully open-source model.
I remember there's a project "Open R1" that last I checked was working on gathering their own list of training material, looks active but not sure how far along they've gotten:
There's a few efforts at full open data / open weight / open code models, but none of them have gotten to leading-edge performance.
out of curiosity, does anyone do anything "useful" with that knowledge? it's not like people can just randomly train models..
what you're saying is just that it's non reproducible, which is a completely valid but separate issue
We have numerous artifacts to reason about:
- The model code
- The training code
- The fine tuning code
- The inference code
- The raw training data
- The processed training data (which might vary across various stages of pre-training and potentially fine-tuning!)
- The resultant weights
- The inference outputs (which also need a license)
- The research papers (hopefully it's described in literature!)
- The patents (or lack thereof)
The term "open source" is wholly inadequate here. We need a 10-star grading system for this.
This is not your mamma's C library.
AFAICT, DeepSeek scores 7/10, which is better than OpenAI's 0/10 (they don't even let you train on the outputs).
This is more than enough to distill new models from.
Everybody is laundering training data, and it's rife with copyrighted data, PII, and pilfered outputs from other commercial AI systems. Because of that, I don't expect we'll see much legally open training data for some time to come. In fact, the first fully open training data of adequate size (not something like LJSpeech) is likely to be 100% synthetic or robotically-captured.
The companies that create these models cant answer that question! Models get jailbroken all the time to ignore alignment instructions. The robust refusal logic normally sits on top of the model, ie looking at the responses and flagging anything that they don't want to show to users.
The best tool we have for understanding if a model is refusing to answer a problem or actually doesn't know is mechanistic interp, which you only need the weights for.
This whole debate is weird, even with traditional open source code you cant tell the intent of a programmer, what sources they used to write that code etc.
Hugging face has a leader board and it seems dominated by models that are finetunings of various common open source models, yet don't seem be broader used:
- live benchmarks (livebench, livecodebench, matharena, SWE-rebench, etc)
- benchmarks that do not have a fixed structure, like games or human feedback benches (balrog, videogamebench, arena)
- (to some extent) benchmark without existing/published answers (putnambench, frontiermath). You could argue that someone could hire people to solve those or pay off benchmark dev, but it's much more complicated.
Most of the benchmarks that don't try to tackle future contamination are much less useful, that's true. Unfortunately, HLE kind of ignored it (they plan to add a hidden set to test for contamination, but once the answers are there, it's a lost game IMHO); I really liked the concept.
Edit: it is true that these benchmarks are focusing only on a fairly specific subset of the model capabilities. For everything else vibe check is your best bet.
I think you just described SATs and other standardized tests
no idea why they cant just wait a bit to coordinate stuff. bit messy in the news cycle.
it's almost as if they don't care about creating a proper buzz.
I'm working on the new one!
of course we can run any model if quantize it enough. but I think the OP was talking about the unquantized version.
For the hobby version you would presumably buy a used server and a used GPU. DDR4 ECC Ram can be had for a little over $1/GB, so you could probably build the whole thing for around $2k
Mobo was some kind of mining rig from AliExpress for less than $100. GPU is an inexpensive NVIDIA TESLA card that I 3D printed a shroud for (added fans). Power supply a cheap 2000 Watt Dell server PS off eBay....
[1] https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4...
There's already a 685B parameter DeepSeek V3 for free there.
Truthfully, it's just not worth it. You either run these things so slowly that you're wasting your time or you have to buy 4- or 5-figures of hardware that's going to sit, mostly unused.
This guy ran a 4-bit quantized version with 768GB RAM: https://news.ycombinator.com/item?id=42897205
There's a couple of guides for setting it up "manually" on ec2 instances so you're not paying the Bedrock per-token-prices, here's [1] that states four g6e.48xlarge instances (192 vCPUs, 1536GB RAM, 8x L40S Tensor Core GPUs that come with 48 GB of memory per GPU)
Quick google tells me that g6e.48xlarge is something like 22k USD per month?
[0] https://aws.amazon.com/bedrock/deepseek/
[1] https://community.aws/content/2w2T9a1HOICvNCVKVRyVXUxuKff/de...
Software: client of choice to https://openrouter.ai/deepseek/deepseek-r1-0528
Sorry I'm being cheeky here, but realistically unless you want to shell out 10k for the equivalent of a Mac Studio with 512GB of RAM, you are best using other services or a small distilled model based on this one.
If speed is truly not an issue, you can run Deepseek on pretty much any PC with a large enough swap file, at a speed of about one token every 10 minutes assuming a plain old HDD.
Something more reasonable would be a used server CPU with as many memory channels as possible and DDR4 ram for less than $2000.
But before spending big, it might be a good idea to rent a server to get a feel for it.
With an average of 3.6 tokens/sec, answers usually take 150-200 seconds.
Whilst the Chinese intelligence agency will have not much power over you.
Having said that, I'm paranoid too. But if I wasn't they'd have got me by now.
You can for instance use them to extract some information such as postal codes from strings, or to translate and standardize country names written in various languages (e.g. Spanish, Italian and French to English), etc.
I'm sure people will have more advanced use cases, but I've found them useful for that.
[0] https://xcancel.com/glitchphoton/status/1927682018772672950
74% smaller 713GB to 185GB.
Use the magic incantation -ot ".ffn_.*_exps.=CPU" to offload MoE layers to RAM, allowing non MoEs to fit < 24GB VRAM on 16K context! The rest sits in RAM & disk.
This whole “building moats” and buying competitors fascination in the US has gotten boring, obvious and dull. The world benefits when companies struggle to be the best.
edit: most providers are offering a quantized version...