DeepSeek v2.5 – open-source LLM comparable to GPT-4, but 95% less expensive (opens in new tab)

(deepseek.com)

193 pointsjchook1y ago75 comments

75 comments

42 comments · 20 top-level

joshhart1y ago· 9 in thread

The benchmarks compare it favorably to GPT-4-turbo but not GPT-4o. The latest versions of GPT-4o are much higher in quality than GPT-4-turbo. The HN title here does not reflect what the article is saying.

That said the conclusion that it's a good model for cheap is true. I just would be hesitant to say it's a great model.

A_D_E_P_T1y ago

Not only do I completely agree, I've been playing around with both of them for the past 30 minutes and my impression is that GPT-4o is significantly better across the board. It's faster, it's a better writer, it's more insightful, it has a much broader knowledgebase, etc.

What's more, DeepSeek doesn't seem capable of handling image uploads. I got an error every time. ("No text extracted from attachment.") It claims to be able to handle images, but it's just not working for me.

When it comes to math, the two seem roughly equivalent.

DeepSeek is, however, politically neutral in an interesting way. Whereas GPT-4o will take strong moral stances, DeepSeek is an impressively blank tool that seems to have no strong opinions of its own. I tested them both on a 1910 article critiquing women's suffrage, asking for a review of the article and a rewritten modernized version; GPT-4o recoiled, DeepSeek treated the task as business as usual.

tkgally1y ago

> DeepSeek ... seems to have no strong opinions of its own.

Have you tried asking it about Tibetan sovereignty, the Tiananmen massacre, or the role of the communist party in Chinese society? Chinese models I've tested have had quite strong opinions about such questions.

5 more replies

theanonymousone1y ago

Thanks for sharing. How about 4o-mini?

mvdtnz1y ago

If OpenAI wants fairer headlines they should use a less stupid version naming convention.

jchookOP1y ago

I updated the title to say GPT-4, but I believe the quality is still surprisingly close to 4o.

On HumanEval, I see 90.2 for GPT-4o and 89.0 for DeepSeek v2.5.

- https://blog.getbind.co/2024/09/19/deepseek-2-5-how-does-it-...

- https://paperswithcode.com/sota/code-generation-on-humaneval

selfhoster111y ago

I am extremely sceptical about the claim that any version of GPT-4o meets or exceeds GPT-4 Turbo across the board.

Having used the full GPT-4, GPT-4 Turbo and GPT-4o for text-only tasks, my experience is that this is roughly the order of their capability from most to least capable. In image capabilities, it’s a different story - GPT-4o unquestionably wins there. Not every task is an image task, though.

stefan_1y ago

Begging for the day most comments on a random GPT topic will not be "but the new GPT $X is a total game changer and much higher in quality". Seriously, we went through this with 2, 3, 4.. incremental progress does not a game changer make.

selfhoster111y ago

I'm sorry, but I gotta defend GPT-4o image capabilities on this one. It's leagues ahead of competition on this, even if text-only it's absolutely horrid.

GaggiX1y ago

The table only shows the models that they managed to beat, so there is no GPT-4o or Claude 3.5 Sonnet for example.

uxhacker1y ago· 6 in thread

It’s interesting to see a Chinese LLM like DeepSeek enter the global stage, particularly given the backdrop of concerns over data security with other Chinese-owned platforms, like TikTok. The key question here is: if DeepSeek becomes widely adopted, will we see a similar wave of scrutiny over data privacy?

With TikTok, concerns arose partly because of its reach and the vast amount of personal information it collects. An LLM like DeepSeek would arguably have even more potential to gather sensitive data, especially as these models can learn from and remember interaction patterns, potentially accessing or “training” on sensitive information users might input without thinking.

The challenge is that we’re not yet certain how much data DeepSeek would retain and where it would be stored. For countries already wary of data leaving their borders or being accessible to foreign governments, we could see restrictions or monitoring mechanisms placed on similar LLMs—especially if companies start using these models in environments where proprietary information is involved.

In short, if DeepSeek or similar Chinese LLMs gain traction, it’s quite likely they’ll face the same level of scrutiny (or more) that we’ve seen with apps like TikTok.

mlyle1y ago

An open source LLM that is being used for inference can't "learn from or remember" interaction patterns. It can operate on what's in the context window, and that's it.

As long as the actual packaging is just the model, this is an invalid concern.

Now, of course, if you do inference on anyone else's infrastructure, there's always the concern that they may retain your inputs.

wongarsu1y ago

You can run the model yourself, but I wouldn't be surprised if a lot of people prefer the pay-as-you-go cloud offering over spinning up servers with 8 high-end GPUs. It's fair to caution that doing might be handing over your data to China.

2 more replies

a21281y ago

It's usually wildly uneconomical to serve such large models yourself unless you're serving a massive amount of users that you can saturate your hardware. Thus most people will opt for hosted models, and most of the big ones will collect your data for future AI training in exchange for a discounted or free service.

fkyoureadthedoc1y ago

Is ChatGPT posting on HN spreading open model FUD!?

> especially as these models can learn from and remember interaction patterns

All joking aside, I'm pretty sure they can't. Sure the hosted service can collect input / output and do nefarious things with it, but the model itself is just a model.

Plus it's open source, you can run it yourself somewhere. For example, I run deepseek-coder-v2:16b with ollama + Continue for tab completion. It's decent quality and I get 70-100 tokens/s.

whatarethembits1y ago

What hardware are you running this on? I’m interested in trying out local models for programming, and need some pointers on hardware

kenmacd1y ago

For most of the world this is a good argument for being cautious of using US-based AI services (and closed-models) as well.

As someone living in America's Hat, without any protections from PRISM-like programs, and who can't even reach DeepSeek without hopping through the US, it's probably less risky for me to use Chinese LLM services.

jyap1y ago· 2 in thread

This 236B model came out around September 6th.

DeepSeek-V2.5 is an upgraded version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct.

From: https://huggingface.co/deepseek-ai/DeepSeek-V2.5

genpfault1y ago

> To utilize DeepSeek-V2.5 in BF16 format for inference, 80GB*8 GPUs are required.

coconut081y ago

I wonder if the new mbp can run it at q4.

1 more reply

DrPhish1y ago· 2 in thread

I run it at home at q8 on my dual Epyc server. I find it to be quite good, especially when you host it locally and are able to tweak all the settings to get the kind of results you need for a particular task.

rightbyte1y ago

I've used it too locally. It is great for some kind of querries or writing bash, which I refuse to learn properly.

I really don't want my querries to leave my computer, ever.

It is quite surreal how this 'open weights' model get so little hype.

selfhoster111y ago

It helps to be able to run the model locally, and currently this is slow or expensive. The challenges of running a local model beyond say 32B are real.

1 more reply

nextworddev1y ago· 2 in thread

Where are the servers hosted, and is there any proof that the data doesn’t cross overseas to China?

selfhoster111y ago

Some models include executable code. The solution is to use a runtime that implements native support for this architecture, such that you can disable external code execution. Or to use a weights format that lacks the capability in the first place, like GGUF. Then, it's no different to decoding a Chinese-made MP3 or JPEG - it's safe as long as it doesn't try to exploit vulnerabilities in the runtime, which is rare.

If you want to be absolutely sure, run it within an offline VM with no internet access.

alphan0n1y ago

What’s the point of this comment? Anyone who can read knows the answer to this question.

There’s literally no attempt to hide that this is a Chinese company, physically located in China.

It’s clearly stated in their privacy policy [0].

> International Data Transfers

>The personal information we collect from you may be stored on a server located outside of the country where you live. We store the information we collect in secure servers located in the People's Republic of China .

>Where we transfer any personal information out of the country where you live, including for one or more of the purposes as set out in this Policy, we will do so in accordance with the requirements of applicable data protection laws.

[0] https://chat.deepseek.com/downloads/DeepSeek Privacy Policy.html

1 more reply

khanan1y ago· 1 in thread

Did you try to ask it if Winnie the pooh look like the president of China?

tpierce891y ago

Don't know if you were being serious, but I asked it for you.

"Winnie the Pooh is a beloved fictional character from A.A. Milne's stories, known for his iconic appearance and gentle demeanor. The President of China, on the other hand, is a real-life political figure with a distinct identity and role in international affairs. Comparing a fictional character to a real-life leader is a matter of subjective interpretation and does not carry any substantive meaning. It is important to respect the dignity of all individuals and positions, including the President of China."

viraptor1y ago

Why say comparable when gpt4o is not included in the comparison table? (Neither is the interesting Sonnet 3.5)

Here's an Aider leaderboard with the interesting models included: https://aider.chat/docs/leaderboards/ Strangely, v2.5 is below the old v2 Coder. Maybe we can count on v2.5 Coder being released then?

shamanic1y ago

In my experience, Deepseek is my favourite model to use for coding tasks. it is not as smart of an assistant as 4o or Sonnet, but it has outstanding task adhesion, code quality is consistently top notch & it is never lazy. unlike GPT4o or the new Sonnet (yuck) it doesn't try to be too smart for its own good, which actually makes it easier to work with on projects. the main downside is that it has a problem with looping, where it gets some concept or context inside its context and refuses to move on from it. however if you remember the old GPT4 ( pre turbo ) days then this is really not a problem, just start a new chat.

TZubiri1y ago

https://www.youtube.com/watch?v=OW-reOkee1Y (sorry for the shitty source)

A word of advice on advertising low-cost alternatives.

'The weaknesses make your low cost believable. [..] If you launched Ryan Air and you said we are as good as British Airways but we are half the price, people would go "it does not make sense"'

zone4111y ago

In my NYT Connections benchmark, it hasn't performed well: https://github.com/lechmazur/nyt-connections/ (see the table).

gdevenyi1y ago

What does open source mean here? Where's the code? The weights?

patrickhogan11y ago

It’s cheaper, but where do you get the initial free credits? It seems most models get such a boost and lock in from the initial free credits.

Alifatisk1y ago

Oh wow, it almost beats Claude3 Opus!

ziofill1y ago

What about comparisons to Claude 3.5? Sneaky.

BoNour1y ago

not bad for a 250B model, would be more impressive if with more fine tunning it matches performance of gpt 4

evil_yam1y ago

open model, not open-source model

nprateem1y ago

As in significantly worse than..?

Giorgi1y ago

In what world "comparable", looks like another Chinese ChatGPT "alternative" that is a crap.

yieldcrv1y ago

tl;dr not even close to closed source text-only modes, and a lightyear behind the other 3 senses these multimodal ones have had for a year

just a personal benchmark I follow, the UX on locally run stuff has diverged vastly

bionhoward1y ago

Sadly it’s equally useless as OpenAI models because the terms of use read “ 3.6 You will not use the Services for the following improper purposes: 4) Using the Services to develop other products and services that are in competition with the Services (unless such restrictions are illegal under relevant legal norms).”

For the billionth time, there are zero products and services which are NOT in competition with general intelligence. Therefore, this kind of clause simply begs for malicious compliance…go use something else.

j / k navigate · click thread line to collapse

75 comments

42 comments · 20 top-level

joshhart1y ago· 9 in thread

That said the conclusion that it's a good model for cheap is true. I just would be hesitant to say it's a great model.

A_D_E_P_T1y ago

When it comes to math, the two seem roughly equivalent.

tkgally1y ago

> DeepSeek ... seems to have no strong opinions of its own.

5 more replies

theanonymousone1y ago

Thanks for sharing. How about 4o-mini?

mvdtnz1y ago

If OpenAI wants fairer headlines they should use a less stupid version naming convention.

jchookOP1y ago

I updated the title to say GPT-4, but I believe the quality is still surprisingly close to 4o.

On HumanEval, I see 90.2 for GPT-4o and 89.0 for DeepSeek v2.5.

- https://blog.getbind.co/2024/09/19/deepseek-2-5-how-does-it-...

- https://paperswithcode.com/sota/code-generation-on-humaneval

selfhoster111y ago

I am extremely sceptical about the claim that any version of GPT-4o meets or exceeds GPT-4 Turbo across the board.

stefan_1y ago

selfhoster111y ago

I'm sorry, but I gotta defend GPT-4o image capabilities on this one. It's leagues ahead of competition on this, even if text-only it's absolutely horrid.

GaggiX1y ago

The table only shows the models that they managed to beat, so there is no GPT-4o or Claude 3.5 Sonnet for example.

uxhacker1y ago· 6 in thread

In short, if DeepSeek or similar Chinese LLMs gain traction, it’s quite likely they’ll face the same level of scrutiny (or more) that we’ve seen with apps like TikTok.

mlyle1y ago

An open source LLM that is being used for inference can't "learn from or remember" interaction patterns. It can operate on what's in the context window, and that's it.

As long as the actual packaging is just the model, this is an invalid concern.

Now, of course, if you do inference on anyone else's infrastructure, there's always the concern that they may retain your inputs.

wongarsu1y ago

2 more replies

a21281y ago

fkyoureadthedoc1y ago

Is ChatGPT posting on HN spreading open model FUD!?

> especially as these models can learn from and remember interaction patterns

All joking aside, I'm pretty sure they can't. Sure the hosted service can collect input / output and do nefarious things with it, but the model itself is just a model.

Plus it's open source, you can run it yourself somewhere. For example, I run deepseek-coder-v2:16b with ollama + Continue for tab completion. It's decent quality and I get 70-100 tokens/s.

whatarethembits1y ago

What hardware are you running this on? I’m interested in trying out local models for programming, and need some pointers on hardware

kenmacd1y ago

For most of the world this is a good argument for being cautious of using US-based AI services (and closed-models) as well.

jyap1y ago· 2 in thread

This 236B model came out around September 6th.

DeepSeek-V2.5 is an upgraded version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct.

From: https://huggingface.co/deepseek-ai/DeepSeek-V2.5

genpfault1y ago

> To utilize DeepSeek-V2.5 in BF16 format for inference, 80GB*8 GPUs are required.

coconut081y ago

I wonder if the new mbp can run it at q4.

1 more reply

DrPhish1y ago· 2 in thread

rightbyte1y ago

I've used it too locally. It is great for some kind of querries or writing bash, which I refuse to learn properly.

I really don't want my querries to leave my computer, ever.

It is quite surreal how this 'open weights' model get so little hype.

selfhoster111y ago

It helps to be able to run the model locally, and currently this is slow or expensive. The challenges of running a local model beyond say 32B are real.

1 more reply

nextworddev1y ago· 2 in thread

Where are the servers hosted, and is there any proof that the data doesn’t cross overseas to China?

selfhoster111y ago

If you want to be absolutely sure, run it within an offline VM with no internet access.

alphan0n1y ago

What’s the point of this comment? Anyone who can read knows the answer to this question.

There’s literally no attempt to hide that this is a Chinese company, physically located in China.

It’s clearly stated in their privacy policy [0].

> International Data Transfers

[0] https://chat.deepseek.com/downloads/DeepSeek Privacy Policy.html

1 more reply

khanan1y ago· 1 in thread

Did you try to ask it if Winnie the pooh look like the president of China?

tpierce891y ago

Don't know if you were being serious, but I asked it for you.

viraptor1y ago

Why say comparable when gpt4o is not included in the comparison table? (Neither is the interesting Sonnet 3.5)

shamanic1y ago

TZubiri1y ago

https://www.youtube.com/watch?v=OW-reOkee1Y (sorry for the shitty source)

A word of advice on advertising low-cost alternatives.

'The weaknesses make your low cost believable. [..] If you launched Ryan Air and you said we are as good as British Airways but we are half the price, people would go "it does not make sense"'

zone4111y ago

In my NYT Connections benchmark, it hasn't performed well: https://github.com/lechmazur/nyt-connections/ (see the table).

gdevenyi1y ago

What does open source mean here? Where's the code? The weights?

patrickhogan11y ago

It’s cheaper, but where do you get the initial free credits? It seems most models get such a boost and lock in from the initial free credits.

Alifatisk1y ago

Oh wow, it almost beats Claude3 Opus!

ziofill1y ago

What about comparisons to Claude 3.5? Sneaky.

BoNour1y ago

not bad for a 250B model, would be more impressive if with more fine tunning it matches performance of gpt 4

evil_yam1y ago

open model, not open-source model

nprateem1y ago

As in significantly worse than..?

Giorgi1y ago

In what world "comparable", looks like another Chinese ChatGPT "alternative" that is a crap.

yieldcrv1y ago

tl;dr not even close to closed source text-only modes, and a lightyear behind the other 3 senses these multimodal ones have had for a year

just a personal benchmark I follow, the UX on locally run stuff has diverged vastly

bionhoward1y ago

j / k navigate · click thread line to collapse