In addition to automated benchmarks, there are also human-rated evaluations, such as Chatbot Arena.
I manually tested DeepSeek v3 against Claude 3.5 Sonnet. In my human evaluation, Claude 3.5 Sonnet outperformed DeepSeek v3, and it also outperforms DeepSeek v3 on SWE Bench. Therefore, the title of the post claiming "DeepSeek v3 beats Claude 3.5 Sonnet and is way cheaper" is wrong.
That said, I was surprised by how well it performed. Its fast. Ironically, I have a paid Claude Team Plan. At the same time I was conducting the evaluations, Claude was experiencing performance issues - https://status.anthropic.com and DeepSeek v3 was not. This is telling for the state of chip sale restrictions.
As someone who just follows this stuff from afar, it is hard for me to conceptualize if this is a SaaS only model, or if it means we are getting to the point where you can have a A1 model on a local machine.
- to LOAD the model, you need at least 768GB of VRAM, which means 10xH100 GPUs or similar.
- to QUERY the model, it then uses one of the 37GB layers to perform the computation at any given time, which means that each GPU can process 2 queries concurrently - (37 * 2 < 80) - and the queries are very fast because of this.
So a single user setup would involve a crazy expensive rack of 10 h100 GPUs that can essentially process 20 concurrent requests almost as quickly as it can process 1 request in a single user mode...
The result is that the model is extremely cheap to operate if served as a SaaS, but ridiculously expensive for a single user setup
Recommended RAM: more than most PC.