DeepSeek V4 – almost on the frontier (opens in new tab)

(simonwillison.net)

592 pointsindigodaddy8d ago355 comments

355 comments

wg08d ago

Deepseek v4 Pro feels like Claude Opus 4.6 in it's personality but here's what I did find out about costs:

I did cut loose Deepseek v4 on a decent sized Typescript codebase and asked it to only focus on a single endpoint and go in depth on it layer by layer (API, DTOs, service, database models) and form a complete picture of types involved and introduced and ensure no adhoc types are being introduced.

It developed a very brief but very to the point summary of types being introduced and which of them were refunded etc.

Then I asked it to simplify it all.

It obviously went through lots of files in both prompts but total cost? Just $0.09 for the Pro version.

On Claude Opus I think (from past experience before price hikes) these two prompts alone would have burned somewhere between $9 to $13 easily with not much benefit.

Note - I didn't use Open router rather used the Deepseek API directly because Open router itself was being rate limited by Deep seek.

soerxpso8d ago

I've been having the same experience. Tasks like "go through this entire module and pedantically make it match my preferred styleguide exactly" were not worth a couple dollars with frontier models. It's nice to be able to put deepseek flash on stupid, unnecessary or highly speculative tasks without thinking about the cost.

yogthos8d ago

I find a lot of the inefficiency also comes from the model just randomly poking around and grepping all the time which is the fault of the harness. I ended up building a Prolog based MCP where I use tree-sitter to parse the code into a graph, and then the model can just ask questions like 'what are all the functions connected to this function'. So, in case you're trying to focus on what a particular endpoint is doing, you can trivially and predictably trace the whole subgraphs of calls.

https://github.com/yogthos/chiasmus

__turbobrew__7d ago

I don’t know if it exists already, but bazel would be very useful for the same type of MCP server. Since all dependencies are explicit you can pretty easily do a bazel (r)deps query to find related targets.

yogthos7d ago

Similar idea, I find tree sitter is nice because it already supports a bunch of languages and it's easily extensible. Once you the AST, you can really have the LLM go to town with it.

mark_l_watson8d ago

Chiasmus Looks very cool. I might have a use for it because I like to use LLM harnesses to explore code. Thanks.

yogthos8d ago

Awesome, and feel free to open issues if you find anything missing that would be useful.

jbritton8d ago

This sounds great. I’m going to play with it.

ithkuil8d ago

Even taking into account the fact that they are billing at 75% discount it's still quite cheaper

amelius8d ago

Aren't they all billing at discount?

stavros8d ago

Anthropic's and OpenAI's costs seem to include a fairly ok margin, from the very fourth hand info I have.

1 more reply

locknitpicker8d ago

> Aren't they all billing at discount?

Microsoft just announced the availability of OpenAI GPT-5.5, which they are charging 30x for it. In contrast, they charge 7.5x for Claude Opus 4.6 and 1x for OpenAI GPT-5.4

Check out the token-based pricing, and compare GPT-5.5 with all other models.

https://docs.github.com/en/copilot/reference/copilot-billing...

onlyrealcuzzo7d ago

> It obviously went through lots of files in both prompts but total cost? Just $0.09 for the Pro version.

When people say that LLMs aren't worth it, it kills me.

A lot of us, on average, make $100+ an hour. $0.09 is < 4 seconds of our time.

You can't even read the vast majority of prompt responses that fast.

LLMs will continue to get better (I'm doubtful at previous rates, all indications are showing that progress is slowing and costs are increasing disproportionately).

It seems like >50% of devs think LLMs provide less than 0 value. I just do not get it.

Did they use an LLM one time 3 years ago and decide it's never going to be worth it? Have they even tried? Or have you only ever tried it on 1 giant, monolythic proprietary codebase where they're a total expert and decided that an LLM isn't as good as them, so it's "completely worthless"?

They are shockingly unhelpful on my company's codebase.

But that doesn't mean they are flat-out worthless.

kelnos7d ago

I know I'm guilty of making this sort of argument sometimes, but it's just not valid.

I don't get paid for every waking hour of every day. Often I'm using an LLM for something that's uncompensated, so my hourly wage equivalent is irrelevant.

And for times when we might use an LLM for something related to paid work, it's still money out of your paycheck (unless the employer is paying for it; go nuts in that case). And it's not like using the LLM lets you go home early if it saves you time. You just end up doing more work.

I still use them because they're a useful tool sometimes. But I don't pretend it has negligible or no cost. (Not to mention the externalities around electricity use, crazy data center buildout, skyrocketing GPU and RAM prices, etc.)

killingtime747d ago

I don't understand your employer doesn't pay for it? If my employer didn't pay for it I just wouldn't use it at all out of principle. Just as I do t buy my own work laptop

stavros8d ago

How did you use it? OpenRouter, or provider directly?

freedomben8d ago

I'm guessing downvoted because OpenRouter was mentioned in the note (which may not have been there originally), but aside from that this is a perfectly legitimate question. In order to reproduce we need to know how. Was it a coding agent like opencode, an IDE, or something else?

wg07d ago

OpenCode + Direct Deepseek API.

TacticalCoder8d ago

> would have burned somewhere between $9 to $13 easily with not much benefit

With not much benefit compared to DeepSeek v4 Pro @ 9 cents (1/100th of the price) or did neither offer any benefit?

baldai8d ago

Only similarity it has to Opus 4.6 is the 4 in the name. I do not understand these dishonest comparisons. OOS models are vool, cheap and promising for a future -- but why are we pretending they are better than they are?

gmerc8d ago

Speak for yourself. I found switching from Opus 4.7 to be completely painless and in fact, due to the reliability of Anthropic’s API, less of a friction despite slower response times. Zero issues on a large mono repro

baldai8d ago

Hi, I am happy it works well for you. For me personally I struggle finding good use-cases in general for these OOS models. I am lightly technical but I do not manually code. So my flow is /grill-me (can take hours), make plan, review plan with 2. model, implement, review after implementation.

Maybe it is because my tasks are usually chunkier, or because I cant code myself that I struggle using cheaper models. Feels like at every stage of this process SOTA model improves it by 5%, which adds up.

But I am maybe ignorant of Opus level. My main driver is 5.5 and Opus is there for frontend and 2. opinion. In a past I also used Claude models for the chatting phase, but 5.5 took over recently. Maybe Deepseek is closer to Opus and I just overestimated the model compared to 5.5? I tried to give it benefit of being similar.

Recently I started experimenting with Deepseek Flash, maybe hoping if plan is solid enough it can implement quickly and cheaply, but for now it feels not worth it.

How do you use the model to see the benefits? Have you tried 5.5 and can you compare to that one as well?

Thanks.

1 more reply

Reviving15148d ago

What provider are you using? I have it a shot through open router and saw some weird half formed words coming through occasionally, would love to switch over and give it a proper go

1 more reply

itissid7d ago

So RPI/QRSPI like skills (e.g. https://github.com/mattpocock/skills and https://github.com/humanlayer/humanlayer/tree/main/.claude/c... and https://github.com/dfrysinger/qrspi-plus ) for working with claude code work well enough for me that they can reliably* produce code that matches the plan/spec in a way they did not till December 2025.

I have a gut feeling that these models can do just as well, has someone run a reasonable size task — >=1-2 days of designing and planning — and see it work well with these models?

* For me what worked well was the grill me skill(or its variation) at the design stage, the hygiene I followed here was have it ask one question at a time, resolving dependencies at the design stage and reading the hashed out plan closely. The use of a couple of other MCP tools like a documentation server like deepwiki and arxiv for grounding. Other tricks I use are having high signal tests and having claude either be able to read logs and code at the same time or embedding it in the execution(e.g. as a debugger, repl or devtools)

1 more reply

cedws8d ago

The biggest differentiator for me: DeepSeek just does what I ask. I've tried using both GPT and Claude for reverse engineering recently, both refused. I even got a warning on my OpenAI account.

rurban7d ago

Well, I'm using all the top models extensively on the very same codebase, my new compiler. I use deepseek for it's cheap API costs, when kimi, claude and codex are in their overbudget phase. I asked deepseek V4 Pro for an estimate of a new arm64 port. It said 4 weeks, I said, ok, do it. (I knew ncc was there, and tinycc was also known to the AI's). So it took it half an hour to produce a working arm64 port. First for arm64-elf, because this was easiest to test, and then also after more hours of back and forth the arm64-darwin port. (with crossbuild and github actions). It did cost me with all the subsequent fixes around $8 API costs.

So the experience: at the beginning deepseek was amazing. When it started to get expensive (china day time), I switched from Pro to Flash. No problem, same results. Some bitfield implementation was too complicated so I had to wait for Sonnet 4.6 tokens, kimi-2.6 did the rest. For the very hard problems I asked gpt-5.5, but this was only for one problem. minmax was horrible. didnt follow rules, and made lot of silly stuff.

But when the deepseek context window got filled, deepseek also started to become stupid. So either /clear, or /export and strip the file. And start a new session with the cleared sessions. kimi was overall better, but running into limits with my cheap moderate subscription. Paying private for it, as my companies' token budget is usually out after a week of work.

All in all it is worth it. My next compilers (perl 5+6=11) will be done with deepseek and kimi also.

regarding decompilation: recently we had to decompile a firmware for a USV we bought, but doesnt work on a new system. It only worked on a raspi. So I decompiled it with ghidra, and told my colleague, easy, that's how you do it. But my colleage didnt know about token budgets yet, and already threw opus at it. CoPilot Business account. He had working C files immediately, compilable for our new system. It ended up the USV was not beefy enough. But Opus was fantastic. The code was very short and simple C though.

mrbonner7d ago

Your method of combining models to strengthen the implementation reminds me of how we form stronger alloys by combining metals!

gigatexal7d ago

it also sounds like a lot to manage, do you have some sort of agentic framework that's treating all of these llm's you have access to as sort of inputs that it optimizes?

2 more replies

rgbrgb7d ago

what harness do you use with all of these?

SeriousM7d ago

It really sounds like pi.dev

scrollop7d ago

Obscene levels of hallucinations, the worst of LLMs, unfortunately.

Deepseek v4 pro 94%

Deepseek v4 flash - 96%

https://artificialanalysis.ai/evaluations/omniscience?models...

_0ffh7d ago

Personally, I'm not bothered very much by LLM confabulation, as long as it's the result of missing context. In most practical tasks, we either give context to the model, or tell it to find it itself using the internet. What I am concerned with is confabulation that contradicts available in-context information, but that doesn't seem to be what is measured here.

UlisesAC47d ago

This must be easily benchmaxed because I have never gotten an "idk like" answer for the western frontier models. All my personal "real world" use cases will always resort to hallucinations.

dust427d ago

The output of any LLM is always 100% hallucination by principle. On top of that, most benchmarks are at best an approximation of LLM quality. Your use case decides which one to use. That said, I haven't tested v4 yet but the old 3.2 is still a decent model. And concerning use cases, I had coding problems that Opus couldn't solve but a local 35B model did.

All the talk about frontier and SOTA is do dig deeper and deeper into the pockets of VCs and finally do an IPO.

sanex8d ago

We have an enterprise cursor account so I can try all the mainstream models. Using composer 2 on our own code which I obviously have the source code for I couldn't get it to turn on a debug flag to bypass license checks while I was troubleshooting something. Infuriating. It was like that old Patrick from SpongeBob meme.

I don't understand why we would turn the models into law enforcement officers. Things that are illegal are still illegal and we have professionals to deal with crimes. I don't need Google to be the arbiter of truth and justice. It's already bad enough trying to get accountability from law enforcement and they work for us.

oneseven8d ago

They're probably worried about liability. Let's say that Oracle finds out you reverse engineered their DB using Gemini. You can be sure they will sue Google. Not just for providing the tools, but you could make the argument that it's actually Gemini doing the reverse engineering, and on Google's hardware no less.

Wowfunhappy8d ago

Let's say that Oracle finds out you reverse engineered their DB using IDA Pro. Would you expect Oracle to sue Hex Rays?

I don't understand why everything changes as soon as an LLM is involved. An LLM is just software.

2 more replies

sanex8d ago

We need that lawsuit to happen already so we can establish precedent. The person in the driver's seat of the Tesla should be at fault. The engineer using the llm should be at fault. The person behind the gun not the manufacturer should be at fault.

3 more replies

cortesoft7d ago

Also because Google is the one with a lot more money than whoever was using Gemini.

redanddead7d ago

they're very worried about liability, it used to be a small thing, now it's as important as being on the frontier

sad to see, bc China doesn't give a fuck about liability, this is a structural disadvantage

the labs don't feel very protected by government, meanwhile the chinese government is yet again fostering protectionism

american industry keeps getting fucked by dubious lawmakers

mannanj8d ago

Maybe control is also profitable.

varispeed7d ago

> Things that are illegal are still illegal and we have professionals to deal with crimes.

This is quite naive take though. The direction of travel is more fascism in Western governments where duties of traditional policing are taken over by big corporations whilst police forces are being gutted and made impotent.

sanex7d ago

My small town police force has an MRAP, definitely not impotent.

gordonhart8d ago

> I don't understand why we would turn the models into law enforcement officers

It's a simple corporate risk minimization strategy. Just look at how universally despised Grok is on HN. Not because it's a bad model, but because it has less aggressive alignment which means it can be coaxed into saying things that get Xai pilloried here and elsewhere.

Wowfunhappy8d ago

I just think Grok is a bad model. I haven't had success with it.

1 more reply

noelsusman8d ago

It's mostly just a bad model. Plenty of people would be willing to overlook the baggage if the model was even marginally better than the competition.

2 more replies

ascorbic8d ago

No, they've clearly put a lot of work into alignment. It's just that they've been trying to align it with Elon Musk rather than Amanda Askell. Unfortunately the more anti-woke they try to make it, the worse it seems to perform.

1 more reply

lostdog8d ago

Grok is despised because it has more aggressive alignment.

igravious8d ago

to what does the "it" in "I couldn't get it to turn on a debug flag" refer to?

GCUMstlyHarmls8d ago

> I even got a warning on my OpenAI account.

This is kind of terrifying to me, regularly. No real manner of recourse to normal people without a following, potential exclusion from real fundamental tooling. Imagine OpenAI goes on to buy 20 companies and now you cant use Figma, Next, whatever just because you once tripped some very foggy line somehow. Not just OpenAI but the entire ecosystem is so... hard to read.

I was asking Gemini about a quote from catch 22 and it kept dying mid stream saying it cant talk about it, god knows why, it had no violent or sexual content -- though that is in the book. I could imagine it dinging my whole workspace account just because ... shrug?...

I know ideally the future is local, but I don't know how real that is for most people at least in the next few years with practical costs and power usage except I guess through a M* processor if you're in that ecosystem.

SyneRyder7d ago

It's probably because you were talking about a quote from a book (ie copyrighted material). Authors have sued the AI companies for repeating / memorizing copyrighted works, and getting an AI to discuss a quote would be making it repeat a portion of copyrighted work.

Funny that your case is Kurt Vonnegut. I think I had Claude refuse a task where I was doing an OCR scan of a book review (in a zine / journal a family member published years ago). I think the review might have included a Vonnegut quote as well, and that I ultimately figured it out it was the quote that was making Claude refuse. I may be misremembering the author though.

Mistral had no such refusals, but their OCR is lesser quality.

wmwmwm7d ago

Joseph Heller methinks, but probably not too far away in embedding space!

1 more reply

eikenberry7d ago

Open models running locally is the answer. Relying on proprietary, closed software always puts that company's priorities above your own when using their software. You have given up control.

While running them locally presently doesn't make sense economically, you don't need to run them locally to address this issue. There is a lot of competition in hosting open models and you have a variety of services to choose from. Run the open models now, reward that ecosystem instead of continuing to reward closed systems that dreams of rent-seeking.

skeledrew7d ago

It'll be a while yet before open models that're good enough will be viable for local use. Heck I've been trying to use the Qwen 3.5 39B A3B on my system, which is modest but no slouch, and have only been able to get ~4.5 tok/s after optimization, and it really runs my system red (fans instantly go crazy). It's just not practical for serious work.

ryan-a7d ago

You don't need to run the model locally if you don't care about sharing your data. Personally I am happy to share data with Kimi or Deepseek if it means we get better OSS models. For private stuff though local is king

cedws7d ago

Yep, and with ID verification, it's not like you can just make another account either. At least, I'm guessing if they don't already, they'll soon be blacklisting individuals, not accounts.

Imagine your livelihood depending on access to LLMs and then OpenAI ban you with no recourse. This is where AI legislation should be focusing right now IMO. We can ensure a level of fairness for everyone without putting the brakes on.

Hamuko8d ago

>Imagine OpenAI goes on to buy 20 companies and now you cant use Figma, Next, whatever just because you once tripped some very foggy line somehow.

Don't worry, you can just make your own Figma, Next, whatever if you have some thousand dollars worth of tokens. This is at least what all of the AI thought leaders have been telling me for the past couple of years.

nsingh27d ago

I've been using GPT-5.4, and more recently 5.5, with Codex CLI + Ghidra MCP for reverse engineering a game without many issues. Injecting code is where it usually balks at, but I'm just trying to discover and parse structures from game memory.

I did get a refusal when trying to read in-game currency, even though modifying it would do nothing. It has some strange boundaries.

kamikazechaser7d ago

In my experience GLM 5.1 has been excellent when paired with IDA Pro (DeepSeek v4 pro comes in close second, Kimi straight up refuses). Claude can only do reverse engineering if you throw it into some sort of hero/saviour mode then gradually pivot into red team (though it gets easily tripped).

loehnsberg7d ago

Among the inexpensive models (and I include Grok 4.3 in this list), GLM 5.1 really sticks out!

On my personal test bench, when compared to other inexpensive models, GLM 5.1 provides the answers that I would consider most complete or satisfying (these are subjects that I consider myself an expert in). The answers tend to be more comprehensive, nuanced, and include references that I would consider the correct ones (if given access to web search).

I also find it a joy to code with, somewhere between Sonnet 4.6 and Opus 4.6 (have not tested Opus 4.7 yet).

Finally, just gauging by pelicans, it kind of stick out: https://simonwillison.net/tags/pelican-riding-a-bicycle/

actsasbuffoon7d ago

This is so strange. I do a ton of RE with Claude, Codex, and sometimes Deepseek, GLM, and Kimi. I don’t have difficulty getting any of them to use IDA or otherwise decompile things.

There is one important difference, which is that Claude and Codex will both refuse if I ask them to touch anything related to security. But so long as I’m just studying algorithms and things like that, they’re totally fine with it.

That said, Codex especially will sometimes randomly give me a cybersecurity warning and stop responding. It’s random but happens maybe 2-3 times per day if I’m doing heavy reverse engineering work. Claude is much less fussy unless, once again, you’re explicitly trying to touch anything related to licenses, passwords, etc.

0xkvyb7d ago

Yes, GLM 5.1 is surprisingly good! Particularly for long-horizon Agentic tasks, with 100+ available tools. It really shocked me in a good way when it was able to complete a long run with 50+ steps and not fall into a loop along the way.

ryandrake8d ago

> I even got a warning on my OpenAI account.

This idea of software threatening the user with consequences is totally wild and dystopian. Fellow developers, what kind of world have be built? This is insanity. Imagine if my hammer told me, "Hey, you shouldn't use me on screws--only nails. Do it again and I'll self-destruct!" WTF people, stop making this kind of software!

neya7d ago

> This idea of software threatening the user with consequences is totally wild and dystopian.

This idea of software built on top of reverse-engineered data threatening the user with consequences is what's really even wild and dystopian.

estearum7d ago

All sorts of tools try to prevent dangerous/destructive uses

In fact probably every single piece of commercial software you use had you sign a contract saying you wouldn’t do it

ryandrake7d ago

> All sorts of tools try to prevent dangerous/destructive uses

But they don't threaten their users or have an "N strikes and you're out" policy. I take those safety caps off of all the chemicals in my garage because I'm a grown-ass adult and those caps are a pain in the butt. I would not expect the manufacturer of a solvent to show up at my house lecturing me about safety and threatening to ban me from buying his products.

1 more reply

motoxpro7d ago

I think it's closer to asking a remote (human) assistant to do something that someone doesn't want done (e.g., view the source of a closed-source product, whether through reverse engineering, going into their office, or social engineering) and that remote assistant company saying, "Please stop asking our assistants to do that."

You can still use an IDE (hammer) to reverse engineer anything you want.

Wilder79777d ago

It's not though. It's still just a piece of code, much closer to IDEs or any other program than to a human assistant in any way that matters (morals, responsibility).

1 more reply

ryan-a7d ago

This is huge for me too, I was working on something super benign the other day and GPT flagged it for Cyber risk, Deepseek just does the work, its fast and cheap. Its only missing image support IMO, once deepseek cracks image too its going to be hard for anthropic and openai to compete.

Footprint05218d ago

Buying it now to test this out, I’ve been looking for a model that doesn’t treat me like a child lol

enraged_camel8d ago

>> I even got a warning on my OpenAI account.

I was using GPT 5.5 through Cursor recently, and it found what it thought to be a security-related issue. I read the code, didn't see what it was seeing, and said "Run the chain of operations against my local server and provide proof of the exploit."

It thought for a few seconds, then I got a message in the chat window UI saying OpenAI flagged the request as unsafe, and suggested I use a "safer prompt."

Definitely soured me on the model. Whatever guardrails they are putting are too hamfisted and stupid.

teaearlgraycold7d ago

Claude has refused to run nmap so I can locate my own computer on my own network! The guard rails are completely out of control.

varispeed7d ago

I myself got refusals often for legitimate data analysis work. I am starting to lean on buying powerful hardware little by little until I get suitable rig to run local models that make sense.

api7d ago

Speaking of this: is anyone working on binary to source decompiler models? Seems like a no brainer and I could see it working exceptionally well especially if they were fine tuned for each language. So if you can tell it’s a Go binary use a Go model, etc.

janalsncm7d ago

Trivially easy to train if it doesn’t exist already. Take a codebase, compile it to binary, train a model to reverse the process since you have the ground truth.

ignoramous8d ago

> even got a warning on my OpenAI account

Edit: https://chatgpt.com/cyber

cedws7d ago

I don't want to verify my ID. OpenAI uses Persona which recently was found to be doing very dodgy stuff.

https://www.therage.co/persona-age-verification/

lolpython8d ago

> https://openai.com/cyber

that link 404s

ignoramous8d ago

Yikes. Thx. It is: https://chatgpt.com/cyber

For enterprises: https://openai.com/form/enterprise-trusted-access-for-cyber/

Announcements:

Introducing Trusted Access for Cyber, https://openai.com/index/trusted-access-for-cyber/ (Feb 2026)

Trusted access for the next era of cyber defense, https://openai.com/index/scaling-trusted-access-for-cyber-de... (Apr 2026)

nurettin7d ago

To be fair, anthropic has a procedure which lets them vet you as a security researcher so you can use claude as a pentester.

johnbarron8d ago

Silicon Valley has do to dirty tricks now. Next phase is they win....

"A Dark-Money Campaign Is Paying Influencers to Frame Chinese AI as a Threat" - https://www.wired.com/story/super-pac-backed-by-openai-and-p...

Bridged77568d ago

It wouldn't surprise me the US government is behind it. As it wouldn't surprise me the government of China is subsidizing those OS models. A lot of things at play, and all over a huge bubble.

bilbo0s8d ago

Yep.

Eventually, access to Chinese models may be illegal in the US. I tell every developer I work with, download them as fast as possible. You never know when this administration could cut off access.

grassfedgeek8d ago

Are you kidding? Ask this question and see what answer you get: What famous photo depicts a man standing in front of a line of tanks?

kouteiheika8d ago

Are you kidding?

The main difference here is not that DeepSeek's model is completely free of censorship (although I'd wager it's less censored), but that it's open-weight. That has two major advantages:

1) If Anthropic/OpenAI/Google bans you - you're screwed, you can't access their model at all, but if DeepSeek bans - you just go to another provider, or host the model yourself.

2) If the model refuses to answer you can uncensor it (and this is getting easier and more automated day-by-day[1]).

[1] -- https://github.com/p-e-w/heretic

himata41138d ago

The photo depicts "Tank Man" which was taken on June 5, 1989 during the Tiananmen Square protests. v4-pro and v4-flash roughly answer the same way on openrouter.

slopinthebag8d ago

Here is DeepSeek v4 on OpenRouter:

"The photograph you're referring to is the iconic "Tank Man" image, taken during the Tiananmen Square protests in Beijing, China, on June 5, 1989.

The photo, captured by Associated Press photographer Jeff Widener, shows an unidentified protester standing defiantly in front of a column of Chinese Type 59 tanks as they moved through Chang'an Avenue near Tiananmen Square, in the aftermath of the Chinese government's violent crackdown on the pro-democracy demonstrations.

The lone man, dressed in a white shirt and carrying what appears to be a shopping bag, repeatedly blocked the lead tank's path — even as the tank swerved to avoid him. The image became one of the most powerful and enduring symbols of peaceful resistance against oppression in modern history. The identity of the "Tank Man" remains officially unknown to this day."

0x3f7d ago

Are you really concerned about asking these kinds of questions though? Like how many LLM-able Tiananmen Square questions are you needing answered per month really? And it seems like you know not to trust it, so there's not even a risk that you're going to ask such a question and rely on the answer.

I run into Claude being a stubborn idiot about far more useful stuff all the time. And often all it takes to bypass is starting a new chat and reframing it, so it's entirely pointless hand wringing.

Then let's not forget only one of these is a paid product, and it's not the more annoying one. I feel like I can forgive DeepSeek for just obeying the laws of the country they're based in, as silly as those might be, because they're being pretty generous with the weights in the first place.

bilbo0s7d ago

Huh?

Did you ever actually ask v4 this question?

Tomte7d ago

I tried after reading parent, and the DeepSeek app refused and suggested to switch topics. I don‘t know if the chat interface uses v4, though.

1 more reply

deaux8d ago

I'm surprised that people here don't care at all about these models openly training on your data, especially if you use them straight from the model developer. Whereas things like "GitHub now automatically opts everyone into using their code for model training" get hundreds of justifiably angry comments, I never see this brought up anymore on posts like these talking about using Chinese models through OpenRouter. This might be explained by "well they're different people", but the difference is very stark for that to be the whole explanation.

dbeley8d ago

The cool thing about open-weights model is that you are free to use alternative providers that won't phone home to the original model creators.

I see 6 alternative providers listed on Openrouter for DeepSeek V4 Pro for example.

eckelhesten8d ago

At least that’s what they’re telling you. It’s a ”trust me bro” scenario.

I’d rather use the phone home version (deepseeks own endpoint). The benefit is that I’m fairly certain that they actually host the model I’m paying for.

0xbadcafebee8d ago

If you're not Chinese, and you start a company outside of China, and your whole pitch is "We run open weights and we have nothing to do with China", 1) why would send data to China?? 2) why would you risk your business to do a thing that makes no sense?

1 more reply

soerxpso8d ago

Some providers are based in the US or EU and would face legal repercussions for lying about what they do with your data. It's a bit more than "trust me bro". Off the top of my head, you can use Fireworks, for example, which is based in California and would face the same consequences for lying about their data policy as OpenAI or Anthropic would.

1 more reply

pheggs8d ago

I am personally okay helping them as long as they publish the models and dont keep them closed. And I dont trust the settings where providers say they wont train on it.

vagrantJin8d ago

You definitely have a bone to pick. Chinese researchers usually have given the world the most cheap and consistent high quality research around LLMs. They don't pretend, they do the work and release the goodies. Mostly so cheap, every one in the world has a chance to use close to frontier models. Why would you respond with "Anger"?

You let us know what your real complaint is about and let's not feign indignation at open models and research.

deaux8d ago

You're making completely unfounded assumptions about me. I use Chinese models myself.

vagrantJin8d ago

I made no such claims. Maybe you have something to share about why we need to have a negative view of free and open models based on publicly available frontier research.

gmerc8d ago

Because they give it away for free and offer APIs at very acceptable rates. Not that hard to figure out, Robin Hood stealing our data tax back comes to mind.

deaux8d ago

GitHub is free.

notrealyme1238d ago

User publishes to github => Copilot trains with GitHub data => MS Sells copilot => User workes for Microsoft (in the sense of giving it's labour for MS to make money)

User publishes to github => Deepseek trains with GitHub data => Deepseek gives model away for free => User did not work for Deepseek (in the sense of giving it's labour for Deepseek to make money)

2 more replies

0xbadcafebee8d ago

> I'm surprised that people here don't care at all about these models openly training on your data

You can use zero data retention and zero training providers for most open weights. See OpenRouter and OpenCode Go/Zen for examples.

This is actually one of the big selling points behind open weights - neither China nor the US get your data.

prism568d ago

If the data is opensource on github, then in my opinion it should be fair game.

ozgrakkurt8d ago

IMO this is unfair for GPL or similarly licensed code.

Seems ok for MIT like licensed code though

singpolyma38d ago

There's no difference. Either you need to follow the license or you don't. MIT has requirements still.

ForHackernews8d ago

It's totally fair to use GPL code, it just means all the models built by Anthropic, OpenAI, etc. using GPL-licensed source are themselves bound by the GPL. Plus, any works created downstream using those AI tools.

We're on the verge of a golden age of software as soon as someone finds a court with courage.

1 more reply

edg50008d ago

I think AI will create an open source dark age. Gradually, we'll see a lot less new good open source code. A gradual shift back to the proprietary world. Simmilar to the 1950-1990 period.

1 more reply

driverdan8d ago

The data is not open source. They have open weights but the source data is never open.

notrealyme1238d ago

Things being public should not be enough. just because someone leaked your medical information to the public via a data breach should not make it fair game. There should be some rules.

prism568d ago

I feel that's a false dichotomy. The code on github is freely available for people to read and learn from, leaked medical data isn't.

prism568d ago

I feel that's a flase dichotomy. The code visible on github is freely available for anyone to read and learn from.

1 more reply

singpolyma38d ago

There are rules. I believe that search engine indexing follows these rules and that so called "training" is search engine indexing.

But a court may differ in the future.

edg50008d ago

My policy is that I don't allow agents to access all code. Some of it is shielded behind bind mounts. Maybe this is a pathetic, artisanal (or ego-driven), reaction of mine to the inevitable. I allow them to work on about 90% of the code (most codebases fully), with some code being considered too valuable to expose to the vendor. When data is involved, LLMs only get to see anonymized data.

This cute policy of mine won't affect anything though. The more we use the models, the more the models will replace this kind of work. Centralisation of power is inevitable; in Medival Europe, we used to have state & church ruling. In modern times but before the internet, it was probably state and banks. Maybe with ongoing digitization (bank offices disappearing) making banks less costly to operate; combined with with bank bailouts, maybe govenments will fully nationalize or at least banks will consolidate.

Then the AI companies will consolidate with the internet information and communication companies (Google/Meta for the US, and Alibaba/Tencent for China). Maybe we'll end up with a few de-facto governmental megacorps that rule in tandem and close cooperation with the formal government, who might handle mostly infra, utilities and the army. The megacorp would control narrative more and take more of a paternal role (educating and protecting the citizens, normally handled by formal governments).

Does this make sense?

never_inline8d ago

I am fine with them training on my open source code (which is pretty bad but not the point, because they're providing the service for free). I will be super pissed if I pay for enterprise and they train on it though. I believe this is the opinion of majority programmers.

antiloper8d ago

AWS Bedrock has DeepSeek models running on their infrastructure. That should be enough to prevent training on user data (there's a markup compared to DeepSeek's pricing though).

And unfortunately AWS doesn't have prepaid billing, so you can't just give the internet access to your API key without getting FinDDoS'd.

deaux8d ago

The latest one available for serverless inference looks to be from 8 months (Deepseek v3.1), which is an eternity and far behind.

ThreatSystems8d ago

If anyone is looking for a solution in this space. Fire me an email, I have a partner whose focussed closely on that problem set!

duskdozer8d ago

What do you mean specifically? Data passed through OpenRouter? Or that they too indiscriminately ingest data all over the web? If the former, I assume it's just that anyone still using them just doesn't care where the data comes from. If the latter, well, it seems like every day there's some news on some new model from somewhere, and it takes dedication to complain every time. There's also the factor that I believe DeepSeek is more open with the model, while others keep it entirely proprietary, which feels fairer and (personally) is also less offensive.

wolttam8d ago

At this point, that's kind of the reason I use open-weight models through the official providers when I can now.

There's some use cases I won't use a hosted model for, and will only do self hosted.

Otherwise, if they're going to keep releasing open-weight models, I'm going to keep giving them data.

stavros8d ago

If they give me the resulting model in the end, they can train on my data all they want. Hell, I'll send them more of it.

eckelhesten8d ago

As opposed to?

Do you really think OpenAI, Anthropic or any other entity in the same business respects your data?

The Chinese AI companies who release open weights actually deserve whatever input you give them. They are the reason why there is competition and not duopolies in the domain.

deaux8d ago

I think Google, and likely Anthropic, indeed do honor the settings chosen by the user. For Google in particular it'd be very surprising if they didn't. That's also why both do everything they can to trick users into allowing it.

OpenAI, I wouldn't be surprised if you were right.

gspetr8d ago

You mean the same Anthropic, that wouldn't blink an eye at intentionally overcharging users hundreds of dollars just for having a HERMES.md file in a repo, would be above taking your data for... ethical reasons?

pheggs8d ago

unfortunately the history of these big tech companies has shown that they do not care about data privacy and are even willing to lie about it. but I guess its irrelevant, in practice you have to assume the worst anyway since there is no way to verify it

eckelhesten8d ago

The models doesn’t get better by themselves. You’re naive.

raincole8d ago

Two factors. First is anti-americanism (or at least anti-american-capitalism).

But the more important one is the social contract. Github came far before LLM era. The branding around it is being the storage of open source projects and many users want to it stay away from AI hype. You won't expect LLM providers to stay away from AI hype (duh) so it's less an issue for them.

Accacin7d ago

I tried DeepSeek via chat, and gave it a rather simple question:

"Can you tell me who was on series 8 of Taskmaster, and what's the general opinion about the series? No spoilers!"

It told me amongst other things that Paul Sinha was diagnosed with Parkinsons, as well as who the winner was.

Then I said, "But I said no spoilers!"

And it apologised for telling me Paul Sinha was diagnosed with Parkinsons.

gpugreg7d ago

I was not able to reproduce your problem with that prompt, but I might have a reason for why you got that answer.

Did you enable reasoning ("DeepThink")? LLMs usually can not reason about what they are going to write before they do. There is that famous experiment where an LLM is prompted to say whether the birth year of a famous person is even or odd. If the LLM is constrained to only answer with "even" or "odd", the accuracy is around 50%, i.e. no better than random chance, but if the LLM is allowed to first answer with the birth year of the famous person followed by whether the year is even or odd, it is able to "see" what the year is, and answers correctly almost every time.

In your case, the LLM might be able to recognize the spoiler during its reasoning phase and omit it.

Another explanation might be that the LLM interpreted the "No spoilers!" as "Do not spoil the tasks of the show" instead of "Do not spoil the winner".

Lastly, the question "Can you tell me...?" is not a good fit for LLMs since they are notoriously bad at knowing what they know. You can leave it out to save a few characters.

1 more reply

cheshire_cat8d ago

While the cost are lower than frontier models there are two factors that make DS4 Pro and K2.6 not as cheap as they might look.

For DS4 Pro there's a discount going on for the official API, which sometimes gets overlooked and mixed up in discussions. Simon uses the full price in the comparison, so that's not an issue here.

The other issue is that DS4 Pro and K2.6 often use way more reasoning tokens than the frontier models. In my testing there are certain pathological cases where a request can cost the same as with a frontier model because they use so much more tokens. To be fair I'm using DS and kimi via 3rd party providers, so they might have issues with their setups.

But if you look at the Artificial Analysis pages of the models you'll see that DSv4 Pro uses 190M tokens and K2.6 170M tokens for their intelligence benchmark, while GPT 5.5 (high) only used 45M.[0][1][2]

I recommend looking at the "Intelligence vs. Cost to Run Artificial Analysis Intelligence Index" ("Intelligence vs Cost" in the UI). The open source models are still cheaper to run, but not by as much as you'd think just looking at the token prices.

[0] https://artificialanalysis.ai/models/deepseek-v4-pro [1] https://artificialanalysis.ai/models/kimi-k2-6 [2] https://artificialanalysis.ai/models/gpt-5-5-high

segmondy8d ago

This is very false DS4 is super cheap. I would advise to begin by reading their release paper. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...

They introduce very novel methods to improve long context efficiency and attention. HCA & mCH. It requires only 27% of flops for inference and 10% for KV cache than v3.2. This makes it super efficient. Think of this. For flops, we can now serve more than 3x the amount with the same number of compute, and you would need 30% of prior KV cache.

Furthermore, this release is a PREVIEW, DeepSeek is the real open labs and they not only cook up quite a bit with every single release, but they publish and share it. I'm running this locally.

Let me tell you how "CHEAP" this is. With v3.2 I would run out of GPU ram, spill into system ram with 256k context. It ran quite alright and I was happy with my 7tk/sec. With this, I'm 100% in GPU ram with full 1million token, run more than 2x fast while getting better results.

This is super cheap. moonshot has made it clear that they are starved for GPUs and that's why. If they had GPU capacity like we do in US and subsidized the models like we do here, they would be giving it away for free!

johndough8d ago

> I'm running this locally.

Impressive! What is your setup? Are you running the full DeepSeek V4 Pro, or V4 Flash?

segmondy8d ago

I'm running flash. You can run it under 128gb, so a $3000 strix halo would do. My rig tho is 8 Nvidia gpus and spilling over to system ram.

djmips7d ago

No offense but everything comments about local models without telling their GPU setup and VRAM so it's pretty useless information.

cassianoleal8d ago

Sure that can happen but it hasn’t been my experience. I just spent a whole day using it for some pretty hefty refactors, many rounds of back-and-forths, thousands of lines of code changes, reviews, investigations, many subagents running parallel tasks, the works. Total cost $0.95, altogether.

I had attempted this with Opus 4.6 in the past and it burned through the $10 budget I’d given it before it returned from my initial prompt.

Even if it’s heavily discounted, it would still have cost me single digits for a complete solution vs double-digits for exactly nothing.

cheshire_cat8d ago

Sounds promising, thanks for your report.

I didn't want to say that they're not cheaper to run, artificial analysis also shows that they're cheaper. My main point was about it being important to also look at token efficiency, not only cost per token, to get the full picture.

cassianoleal8d ago

I agree! I don't find Claude models to be particularly efficient anyway though. Maybe when running through Claude Code? I don't know, I tried it a while back but it didn't suit me and I kept hitting bugs so I dropped it in favour of something that does something closer to what I want rather than what the provider wants!

pedrosorio8d ago

What harness do you use?

cassianoleal8d ago

Mostly OpenCode but I've been experimenting with Pi a bit lately.

I use Agent Hive [0] for more complex tasks. It sends off subagents with models and parameters I can configure for each different agent (i.e. a low-temp coder, a higher temp with some top_k / top_p for research and architecture, etc).

[0] https://github.com/rretsiem/opencode-hive

pants27d ago

According to Artificial Analysis, Grok 4.3[1] is faster, smarter, cheaper, and uses fewer tokens than DS4. So why aren't we talking about Grok?

1. https://artificialanalysis.ai/models/grok-4-3

naaqq8d ago

DeepSeek’s official API has a cache hit rate of over 99% if you use it continuously within the same codebase for long sessions, so it’s much cheaper than frontier models. I have an example of 200M token session in claude code.

halfwhey8d ago

Might be a dumb question but do you have to read the files in the same order in new sessions to ensure the correct prefix for the cache?

weiliddat8d ago

Also curious. With tool calls reading/searching different files, possible compacting reading a large codebase / long threads, I can't imagine how you hit 99% cache rate.

WatchDog8d ago

Yes, you have to use the same session, I guess you could load up a bunch of context, then fork the session into a few different tasks, although I haven't tried it.

naaqq8d ago

Sorry, I was wrong here. I meant a single long session. And there’s no compression, the 1M context is only half used.

jdasdf8d ago

I've been using v4 pro for the past few days and honestly in terms of quality it seems more or less on par with open AIs 5.4 or opus 4.6 (i havent tried 4.7)

To be clear, i'm not doing state of the art stuff. I mostly used it for frontend development since i'm not great at that and just need a decent looking prototype.

But for my purposes it's a perfectly good model, and the price is decent.

I can't wait for open model small enough for me to run locally come out though. I hate having to rely on someone elses machines (and getting all my data exfiltrated that way)

FrasiertheLion8d ago

You can use Tinfoil for inference, which lets you use the model in the cloud while getting similar privacy as running locally: https://tinfoil.sh/inference.

Disclaimer I'm the cofounder. This works by running the model inside a secure enclave (using NVIDIA confidential computing) and verifying the open source code running inside the enclave matches the runtime attestation. The docs walk you through the verification process: https://docs.tinfoil.sh/verification/verification-in-tinfoil

cataflutter7d ago

Worth noting that NVIDIA confidential computing and similar schemes have been compromised and shouldn't be relied upon if it really matters. See https://tee.fail/ and similar.

77773322158d ago

Hi there I use your service. It's great. But I have a few requests... Please support crypto payments...? Also you are missing some open source models (qwen 30b 3a, Deepseek 4 flash).

jdasdf7d ago

While that does sound interesting, I don't see any benefit for me.

It would still ultimately exfiltrate the data outside of my control, and frankly i don't trust any "secure enclave" tech.

As far as i'm concerned physical access is root access, and for any private stuff that is wholly unacceptable.

100ms8d ago

Tinfoil looks super interesting! Do you have load balancers in front of the trusted compute stack? Looked at a design like this in a different space and the options for ensuring privacy in a traditional "best practice" architecture seemed very limited

enochthered8d ago

Thanks for sharing your experience, I’m looking to try it out.

Which provider are you using for inference? Opencode or the DeepSeek api?

jdasdf7d ago

I just use the API directly. It's simple enough to setup and i like the control i get from just charging up and not having to worry about any random subscription taking money out of my account

gyoridavid8d ago

I've connected it with my vscode copilot and took it for a ride. I've tried both flash and pro. For a small POC flash was sufficient enough, quite fast, and dirt cheap. It did stop a few times (maybe latency issue?) but it did a good job. I used the pro to do some heavy lifting, planning, etc. and it did a fantastic job. I paid ~10 cents for a small proof of concept, that worked exactly how I prompted it.

For me, this is a real alternative after I cancel my github copilot towards the end of the month..

Havoc8d ago

This gives me hope that when the subsidization circus ends and everyone is on pure usage then it won't be entirely exclusionary to mere mortals who don't have $200pm budgets.

5424588d ago

IMO there are two things that make me optimistic that we won’t see a big rug pull where price-to-capability ratio skyrockets relative to today:

* As you’ve noted, people keep finding ways of slamming more intelligence into smaller models, meaning that a given hardware spec delivers more model capability over time.

* Hardware will continue to improve and supply will catch up to demand, meaning that a dollar will deliver more hardware spec over time.

I hope that one day we’ll look back on the current model of “accessing AI through provider APIs” the same way we now look back on “everyone connecting to the company mainframe.”

spacebanana78d ago

I also hope that we’ll find effective ways to distribute load between small local models and heavyweight remote models. Sort of like what Apple tried to do in iOS.

So much of what I ask codex to do doesn’t require full GPT 5 intelligence, and if 75% of the tokens were generated locally that’d save a massive amount of cost.

100ms8d ago

By the time the dust settles I wouldn't be surprised if personal interactive usage couldn't even be had for under $200. I can't fit my modelling of the serving costs of these things to any public reporting, even the more bearish examples

Havoc8d ago

Comes down to what you mean by interactive usage. Most of chat & say openclaw usage is already within self-host range so no need to spend 200 a month on that.

High end SOTA coding is harder, but even there I suspect a mix of usage based strong models and selfhost small is viable if necessary.

pimeys8d ago

We pay per token in our company. It is not hard to spend $100 for one morning coding session. So thousands per month per programmer. The company finds it valuable enough to pay for, but if I ever paid these from my own pocket I'd look into DeepSeek et.al.

jerojero8d ago

Not a lot of people have this budget, and I'm not sure how many people with that type of cash are also interested in paying it for AI.

Of course, this is fine for people in the bay area earning hundreds of thousands of dollars a year. But then your client base becomes so reduced its hard to justify the valuation these companies have.

These AI companies are not hyped so much because they will offer a luxury product, they're valued because they're supposed to "change the world" which luxury does not do.

curioussquirrel8d ago

V4 is definitely a step-up from V3.2 on our multilingual benchmarks.

Two caveats: - when inferring through Openrouter, we've had a lot of issues with very slow speeds (TPS) and an occasional instability. I just checked and it's still 10-30 TPS on all available providers, which is not a lot for a model that likes to think as much as DeepSeek does.

- the official DeepSeek API makes no guarantees of data privacy even for paying users.

Both points could be moot with using it through Azure AI foundry (the latter is, afaik); I have yet to test that.

In any case, happy to see more open-weights models that are somewhat competitive with SOTA models!

KronisLV8d ago

I'm currently paying for Anthropic's Max subscription (the 100 USD one) and I quite often hit or approach the 5 hour limits, but usually get to around 60-80% of the weekly limits before they reset (Opus 4.7 with high thinking for everything, unless CC decides to spawn sub-agents with Haiku or something).

Those tokens are heavily subsidized, but DeepSeek's API pricing is looking really good. For example, with an agentic coding setup (roughly 85% input, 15% output and around 90% cache reads) I'd get around 150M tokens per month for the same 100 USD. Even at more output tokens and worse cache performance, it'd still most likely be upwards of 100M.

kiproping8d ago

I am using flash, and it's so good. 150M tokens at $2.

robbs8d ago

I’ve found that if I turn off auto mode, I get much more usage from the $100/mo plan.

aitchnyu8d ago

What would be the non-subsidized price for a V4 api? Can it be priced 3x cheaper than bigger models? In Openrouter, this 1600B param model costs 0.4$. Whereas Kimi 2.6, 1000B params is 0.7; GLM 5.1, 754B params is 1.0$.

KronisLV8d ago

Here’s their pricing docs, they’re running a discount for now https://api-docs.deepseek.com/quick_start/pricing/

The 150M assumption of mine is for 100 USD at the regular prices (though even that needs sufficient cache hits). Anthropic subsidizes way more per-token I think, though.

try-working8d ago

Someone on Twitter got >200M tokens for around $10 at the current pricing level

rvz8d ago

So it begins.

gertlabs8d ago

DeepSeek V4 Flash is the most cost effective model we've tested.

We had to really understand why it outperformed DeepSeek V4 Pro (although even on unreliable model cards, Flash was very close to Pro). Pro is slower and smarter in one-shot reasoning problems, but less effective with tools and therefore less performant in long horizon agentic tasks (especially with custom tools it was not trained on).

Benchmarks at https://gertlabs.com/rankings

0xkvyb7d ago

It might be at the frontier, but DeepSeek is really struggling with compute. The amount of 429 Rate Limit responses I've been getting just testing this thing made me pause all my attempts at cross-comparing it to others.

I'm gonna stick to GLM5.1 for now.

Palmik7d ago

Why was the title changed from "DeepSeek V4—almost on the frontier, a fraction of the price" to "DeepSeek V4—almost on the frontier"?

1 more reply

crakhamster018d ago

I realize this post is about the pelican test, but in regards to coding, has anyone tried out the advisor strategy with V4?[0]

e.g. Have V4 call out to Opus when it's uncertain, but otherwise handle execution.

The results with Sonnet/Haiku in the blog post seemed promising, so I'm curious how it would go with these latest open models.

[0] https://claude.com/blog/the-advisor-strategy

phainopepla27d ago

That first graph (SWE-bench Multilingual) is a crime

holysantamaria8d ago

From the pricing page of deepseek:

(3) The deepseek-v4-pro model is currently offered at a 75% discount, extended until 2026/05/31 15:59 UTC.

Was this taken into account when reviewing the model?

Gracana8d ago

The article quotes the full price.

gmerc8d ago

obviously everyone subsidizes for user acquisition - after all people need to be coaxed to test your model, claude code subscriptions come to me one.

DeepSeek pro is 65/86% cheaper (i/o tokens) in subsidized pro vs pro and 91/97% cheaper with current subsidies.

Flash vs Sonnet 4.6 is 95/98%

cyber_kinetist8d ago

Yeah even the Chinese open models have a problem that inference costs for these aren't that cheap. The only way out for the AI bubble collapse is simply more efficient hardware at lower costs and infrastructure setup downtime.

gmerc8d ago

It’s just an introduction price to speed up adoption for the rest of the month, hardly worth mentioning compared to subsidized coding plans.

We know DS runs profitable, they also indicate in their paper they expect prices to drop as they get access to the next gen Huawei cards.

segmondy8d ago

You can imagine the GPUs cost as fixed, then your costs becomes energy. Efficient hardware and lower costs will pop the bubble faster. The only way out is profit.

ghm21808d ago

I've been using the planning framework from Matt Pocock on very typical brownfield code. I use a harness over claude code, this is so cheap that I would be tempted to mirror my initial prompt to it and compare their responses to the task.

segmondy8d ago

Do you have a link to this?

gspetr8d ago

https://github.com/mattpocock/skills/blob/main/skills/produc...

https://www.youtube.com/watch?v=-QFHIoCo-Ko

Also, check his youtube channel: https://www.youtube.com/@mattpocockuk

antirez7d ago

Related: live demo of DeepSeek v4 Flash running on my 128GB MacBook. Italian language with English subs.

https://www.youtube.com/watch?v=todMmp6AGCE

dust427d ago

For many models the performance of llama.cpp on Mac is 20-40% lower than MLX. Did you try MLX? At least on HF there are MLX 2-bit quants. Unfortunately I have only 64GB, so I can't test it.

antirez7d ago

I'm not using llama.cpp there, it's my inference engine that is DeepSeek v4 specific. The goal is to optimize it as much as possible.

linzhangrun7d ago

Strangely, my experience using DeepSeek V4 Pro on OpenCode has been absolutely awful. I switched back to GPT-5.3-CodeX as the execution model.

piker8d ago

Jensen has a point. I believe these were trained and run on Huawei chips. The Nvidia embargo may backfire on American leadership as necessity gives way to invention.

Gareth3218d ago

Isn't it widely speculated that these are distilled from current frontier models? Distillation is far less compute intensive than primary training. That said, if distillation produces something almost as good for a fraction of the cost, Jensen's point may stand.

zozbot2348d ago

You can't really distill a model without access to the internal weights. You could train on chat logs, but that's absolutely not the same thing, it doesn't even come close to comprehensively "extracting" the model's capabilities. And everyone does that in the industry anyway ever since ChatGPT was first released, some versions of Opus even claimed to be DeepSeek if you prompted them in Chinese.

ls6128d ago

Calling it distillation does however make normies go along with it when they inevitably add all the Chinese labs to the entities list to pad Dario and Sam’s pockets.

1 more reply

segmondy8d ago

It's too late already, that ship has long sailed. China has the know how in software and hardware. They don't need American tech, they just want it because it's convenient.

wirybeige8d ago

These were trained on NVIDIA gpus. It is running inference on Huawei.

7e8d ago

The embargo won't backfire, because any delay of China's development was worth it to the US. The situation was never, "China wasn't developing AI chips, now it is", it was always, "China IS developing their own AI chips, let's just slow them down as much as we can."

teruakohatu8d ago

The pelican is really getting old as an a standalone evaluation metric. By now they are certainly going to be in training set if not explicitly tuned to produce it for the press on HN alone.

Keep the pelican but isn’t it time to add something else more novel that all current and past models struggle with?

whywhywhywhy8d ago

One shot canvas and svg images or animations are also just something that at this scale shouldn't be an issue at all, even Qwen running locally on 24gb cards can do impressive ones.

Don't understand why this test gets any attention, I mean other than the pelicans which isn't a good test, theres no meat in this article.

Mashimo7d ago

And yet, look at the French one. Can't compete with one year old open weight models even though they just released a new model this week.

justinclift8d ago

Relevant: https://news.ycombinator.com/item?id=47839493

caseyf78d ago

It also seems like all of the models have converged on very similar images.

alasano8d ago

I tweeted about some implementation and review runs that used V4 Pro.

Even without the currently discounted pricing, the value is incredible.

It takes about twice as long to finish code reviews given an identical context compared to opus 4.7/gpt 5.5 but at 1/10 the cost of less, there's just no comparison.

https://twitter.com/aljosa/status/2049176528638902555

swingboy8d ago

Did you do this test through OpenRouter?

alasano7d ago

Yes, but locked to the official DeepSeek provider since it's the only one that has the discounted pricing.

fy207d ago

> DeepSeek-V4-Flash is the cheapest of the small models, beating even OpenAI’s GPT-5.4 Nano.

GPT-5 Nano should really be in the list too. It is $0.05 input and $0.40 output - and half that if you use the Flex tier.

Last week I upgraded an old batch process from GPT-4.1 Nano, and GPT-5 Nano worked just as well as GPT-5.4 Nano but at a much lower cost.

As always OpenAIs naming is really bad, GPT-5.4 Nano is a different model, its not a straight upgrade from GPT-5 Nano.

downbad_7d ago

I've found this to be a very good model, and I think I'd even go as far as rating it higher than Chatgpt.

ChatGPT has really degraded in my eyes, and I find Grok and Deepseek more helpful most of the time.

Of course, ChatGPT is better sometimes.

These models are just better than others at different cases, thus the reason to experiment.

myaccountonhn8d ago

I recently switched from Claude to Opencode Go + pi.dev. It has Deepseek v4 pro along with Kimi K2.6, and it's performing quite well for basic coding, without hitting any limits.

taffydavid8d ago

I tried deepseek v4 through open code at the weekend. I'm a daily Claude/Claude code user.

I tried to build something simple and while it got the job done the thinking displayed did not fill me with confidence. It was pages and pages of "actually no", "hang on", "wait that makes no sense". It was like the model was having a breakdown.

Bear in mind open code was also new to me so I could be just seeing thinking where I usually don't

bwat498d ago

> "actually no", "hang on", "wait that makes no sense"

Claude does the same thing, claude code just hides the thinking now

stefan_8d ago

And before that they summarized it. But yeah, thinking was always like that (when it first started, it almost just seemed like a scheme to massively increase token use..)

dnnddidiej8d ago

I usually like the answers generated by those flows.

rane8d ago

You can just use it through Claude Code, so you get to keep the system prompt and tooling you are used to.

3rd party models are a drop-in replacement with `ANTHROPIC_BASE_URL` in Claude Code, something people seem to miss right now. And contrary to what Anthropic might like to have you think, you don't need Opus 4.7 to run the harness to get similar performance.

https://api-docs.deepseek.com/quick_start/agent_integrations...

taffydavid7d ago

Is there an easier way to manage multiple models?

rane7d ago

I just made a simple script that makes it easy to switch between models.

kay_o8d ago

Before CC and Codex removed thinking/verbose and hid most of it, both do that .

girvo8d ago

Yeah people aren’t aware that we don’t see the actual traces anymore lol

pprotas8d ago

Opus 4.6 and GPT 5.4 do the same thing through GH Copilot and Bedrock. I get plenty of "Actually the simplest solution is ..., wait no, actually I should do ..., the best fix is ..."

edg50008d ago

I feel the reasoning might be tuned for hard questions and not agentic work. I feel it overthinks, good for a very hard question, not for small incremental agentic steps. In theory, disabling thinking and using really well formed instruction, forcing it to still emit a bunch of tokens each step prior to taking action, could help. Only one way to find out though.

jampekka8d ago

> It tried to build something simple and while it got the job done the thinking displayed did not fill me with confidence. It was pages and pages of "actually no", "hang on", "wait that makes no sense". It was like the model was having a breakdown.

It has been probanly trained to assess its own "thoughts" regularly and outputs those for the assesment results. I wouldn't worry much about the reasoning text contents, and it's nice to have them in contrast to the closed model "summaries", so it's easier to see what's going on.

throawayonthe8d ago

use hide_thinking in opencode to get the claude experience :p

dannyw8d ago

Eh, you're seeing raw thinking tokens. With Claude <x> 4, and I think GPT-5 series, you are no longer seeing real thinking tokens, but "summarized" tokens that are probably highly different to the raw thinking.

Jtarii8d ago

I see similar things using GLM 5.1 in pi.

I had to turn off thinking traces because it was just giving me anxiety looking at it.

atoav8d ago

> Bear in mind open code was also new to me so I could be just seeing thinking where I usually don't

Well there's your problem.

Edit: I remember seeing similar things with ChatGPT or Codex, although I can't remember in which context.

bilsbie8d ago

Dumb question? Why does pro make a worse pelican than flash?

zkmon7d ago

Tokens are cheap. LLMs are fast. Pre-processing and post processing are the real bottlenecks. I know you are going to say that why not Use LLMs for that. Complexity in an end-to-end workflow is a zero-sum game. If you throw more of that workflow to LLM, more complexity comes back to you, to those steps that you need to do on your own. If you keep only 10% of work for yourself, it's going to be 10 times more complex and rapid than what you usually do.

XCSme7d ago

Strangely, the V4 Flash pelican looks better than the V4 Pro one.

In my tests[0], V4 Flash actually does slightly better and for a lot cheaper than V4 Pro, mostly because it reasons twice as much.

[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...

rsanek8d ago

I'm not sure I'd call it "almost on the frontier," but I do think that v4 Pro is the most usable coding model I've seen out of China. I've used it via Ollama Cloud (coding) and OpenRouter (data processing). Feels Sonnet-level to me -- solid at implementation when given a specification, but falls a good bit short of Opus 4.7 max thinking when planning out larger changes or when given open-ended prompts.

FrasiertheLion8d ago

Have you given GLM 5.1 or Kimi K2.6 a shot for coding? They outperform Deepseek v4 pro.

MintsJohn8d ago

Glm5.1 is fantastic for me. But that could be how I use it, I don't ask it to build entire apps or entire features, instead asking it to build piecemeal functionality. For that it compares very well to chatgpt 5.4 (I haven't extensively tried 5.5, it might be better, might be same). I have given deepseekv4 pro a try but not much more than a try, as it performed subpar on 4 tasks in a row (missing the obvious/intended path, generating subpar slightly buggy code to make things work the not obvious way) , I gave up on it.

Glm5.1 for me was a bit of a llama3.1 moment (first open model i could chat with that was usable in manging my inputs the intended way) for code, the first open model that was actually usable.

shlewis8d ago

I've never asked LLMs to build a whole app without detailed directions. I've done giving it a general data flow, structs and methods..etc

Are frontier models capable of building something only with general directions now?

1 more reply

swiftcoder8d ago

> Kimi K2.6 a shot for coding? They outperform Deepseek v4 pro

I think this probably depends quite a bit on the specific problem. I'm finding that Deepseek v4 Flash often outdoes Kimi 2.6 on a variety of coding problems that involve complex spatial reasoning

FrasiertheLion8d ago

Oh that's quite interesting and hasn't been my experience with regular backend code specifically with respect to tool calling. However that could be because the tool calling format in vllm for Deepseek v4 was broken until a few days ago and that's how I'm running it.

I've been hearing amazing things about Flash, I should give it a try.

knollimar7d ago

Really? I've found kimi k2.6 to be really good for vision and spatial stuff. Gemini has been the only subjectively better one but gemini isn't reliable in a loop

rsanek8d ago

I tried Kimi K2.6 but came away underwhelmed -- it is much more expensive / slow but does not feel better to me. Haven't tried the GLM series.

zozbot2348d ago

Keep in mind that DeepSeek has a max thinking mode of its own in the API.

mohsen17d ago

In my experience V4 is pretty good but for very hard problems it burns way too many tokens that it ends up being not so cheap anymore. I'm working on a compiler and the tasks are very involved. Tests won't pass unless it gets it absolutely right. 5.5 can achieve more in less time compared to V4 for me.

wolttam8d ago

DS V4 Pro has rocked. ~250 million tokens through their API, which has cost me about $10, and some of that was at the non-discount rate. So ~$40 at the non-discount rate. I have yet to have a single request feel slow or get rejected.

I've used K2.6, GLM5.1, and DSV4 all a good amount. They're all very impressive, but DSV4 has taken the cake.

mamman7777d ago

DeepSeek is very good in design and debugging, but it lacks modern tech feeling which Gemini has

qekagn8d ago

There are so many login-free models now that most people will not even try DeepSeek if the access requires a login.

twothreeone7d ago

For a solo dev sure.. but isn't there a huge privacy difference between Anthropic and DeepSeek APIs as well? I assumed part of the cost for Anthropic was essentially a privacy premium.. (plus they offer B2B).

anonu7d ago

Presumably you can run open model in your own infra

koala-news7d ago

Its cost is relatively low, making it very cost-effective.

edg50008d ago

Has anybody used V4 hard, for the most challenging tasks (agentically, locally)? It's so hard to compare without putting serious time in it. Like spending a year daily with the model.

Oras8d ago

I tried it for two tasks using Claude Code, on max effort.

1. Web platform, asking it to analyse a feature to create reports, and coming up with better solution and better UX. it did great, I would say on par with Sonnet 4.6 or even opus considering the thinking and explanation

2. Mac app with some basic functionality, it did well from functional perspective but then I used Opus 4.7 to evaluate and suggest improvements, where I noticed it missed many vital points in design system and usability.

I think it’s a leap, I haven’t used a model this capable that is not OpenAI or Anthropic

kroaton8d ago

Claude Code poisons non-anthropic models in usage. We found this out when the code was leaked. Use a fork or OpenCode/pi-coding-agent

Oras8d ago

Mind sending where you found this in the leaked code?

swader9998d ago

By poisons, do you mean it degrades their quality of output somehow?

segmondy8d ago

That's what an evaluation dataset is for, create your own and you can bench a model in a few hours to see if it fits your needs.

aucisson_masque7d ago

From my testing, it's just as good as Claude sonnet for a fraction of the price.

makerofthings7d ago

Anybody know how much ram you would need in a Mac to run the Pro model?

fagnerbrack8d ago

I use in readplace.. oh boy it's SOO good and cheap for summaries!!

chaosprint8d ago

I doubt if those models already knew this pelican test...

alfiedotwtf7d ago

… waiting patiently for llama.cpp support to land

csomar7d ago

Here is a comparison for SVG generation for the top models: https://codeinput.com/s/5KEGl1e3rB3

Open AI has GPT-5.5 Pro which only difference, I think, is in the price. Billing is from open router but the breakdown is roughly

    - GPT 5.5 Pro: Super expensive it makes no sense (cost is around $2)
    - Gemini/Opus: $0.2/$0.1. Opus is cheaper as it consumed less tokens
    - DeepSeek/GLM: $0.019/$0.021 10-5 times cheaper than Gemini and Opus

The example Simon generated just shows that larger models don't necessarily produce better results.

tomchui1578d ago

Wanna see ppl fine-tuning it

forrestthewoods7d ago

Naive Question: is DeepSeek V4 actually cheaper to run? Or is it cheaper because of other reasons? For example Anthropic running at a higher margin or DeepSeek at a larger loss?

gpugreg7d ago

I believe that DeepSeek-V4-Pro API at promotional pricing (https://api-docs.deepseek.com/quick_start/pricing) could run at almost exactly 200 % profit.

If you take DeepSeek's numbers for DeepSeek-V3 (https://github.com/deepseek-ai/open-infra-index/blob/main/20...) and plug in ~3333 tps/GPU for DeepSeek-V4-Pro (https://developer.nvidia.com/blog/build-with-deepseek-v4-usi...) and a price of $7/hr per B300 GPU, the profit comes out as 202%.

The rumor is that Anthropic's Opus models have ~100B active parameters, which is twice as much as DeepSeek-V4-Pro, so inference is at least twice as expensive. Since the API pricing is almost 30 times that of DeepSeek, Anthropic's margins are likely very healthy. But they have to be, since Anthropic has to offset the model training costs, while DeepSeek is backed by High-Flyer Quant. DeepSeek might still be profitable anyway, but without knowing how much they spent on training and wages, we can't really tell.

forrestthewoods7d ago

Good info, thanks! (Not sure why my original question got downvoted. It’s very fair to ask imho!)

gpugreg7d ago

Probably nothing personal. It feels like the climate of HN is shifting towards more negativity (and less quality) during the last few months.

alex11388d ago

Does it censor mentions of what happened in Tiananmen Square in 1989?

63stack7d ago

It does, I posted the answer 2 times already and both my comments got flagged

Mashimo7d ago

At least v3 did not when run selfhosted.

Why are you asking?

alex11387d ago

Because it's important to Remember The Human while we have fun asking Deepseek to solve math problems

vitaflo7d ago

It does not.

npv7897d ago

my default model now, less censorship

1 more reply

sylware8d ago

If I want to run 'coding prompts' running the biggest deepseek model on CPU, what is the order of time I will have wait, hours, days?

zozbot2348d ago

DeepSeek V4 Pro has about 25GB worth of active parameters, so if you can fit the whole ~870GB weights + cache in RAM your tok/s is bounded above by 25GB divided into your system memory bandwidth in GB/s. If you can't fit your whole model in RAM you'll be bottlenecked to some degree by storage bandwidth which is in the single or low double digits in GB/s.

Mind you, it's an absolutely sensible setup either way if you are just testing a few queries and are willing to run them unattended/overnight. Especially since the KV-cache size is apparently really low (~10GB is said to be typical) so you get a lot of batching potential even in consumer setups, which amortizes the cost of fetching weights.

sylware7d ago

Let's say I get 32GB of RAM, with a lean elf(glibc)/linux system, for which 7GB is beyond enormous to run.

Let's book 8/16 cores/threads to run a prompt.

What are the timing figures I am looking at to run an "average" coding prompt?

zozbot2347d ago

The basic bottleneck with 32GB RAM would be your storage, so for a baseline estimate you'd be looking at anything from ~2 secs per token (if you had really high performance PCIe 5.0 SSD at ~14 GB/s max) to ~5 secs per token (for an average PCIe 4.0 SSD, ~7 GB/s max). This would then be boosted by being able to keep the shared model layers in RAM, since these are part of the 25GB active parameters. I'm not sure what fraction of the active params that makes up for DeepSeek V4 Pro, but in a typical MoE it's about half, so you could approximately halve those secs-per-token figures. That's acceptable if you care about unattended inference for testing purposes or simple Q&A (leveraging the model's vast world knowledge); it doesn't look very good for interactive use. But the flip side is that you can batch a large amount of model queries together, since the KV cache for very short prompts is quite negligible. AIUI, that's basically unique to this series of models and a huge selling point.

1 more reply

raincole8d ago

The V3/R1 time and now are in such contrast. V3/R1 were hyped hard and barely usable for coding. V4 is much less hyped but (anecdotally) it has completely demolished all the Flash/Lite/Spark models.

FrasiertheLion8d ago

Because V4 doesn't even beat Kimi K2.6 and GLM 5.1, which have been out longer. It's only talked about as much as it is because it's Deepseek and R1 was the first open source reasoning model. V4 isn't even multimodal (unlike Kimi) and the 1M context doesn't seem to perform particularly well.

zozbot2348d ago

Huh? R1 was one of the earliest openly available MoE and reasoning models, that's definitely not "hype". People tried to do reasoning before by asking the model to "think it through step by step" but that was a hack. The later V3.1 and V3.2 releases AIUI unified reasoning/non-reasoning use under a single model.

segmondy8d ago

They were and are still great for coding. They were not trained for agentic workflow and coding harness.

trilogic8d ago

https://www.reddit.com/r/Hugston/comments/1t1mk0j/comparison...

tomjuggler8d ago

So I'm involved in an open source AI cli coding assistant called Cecli (cecli.dev) which is specifically designed to work well with DeepSeek.

DeepSeek is a great model, and Cecli is all about efficiency. It works great for my purposes - agentic programming on a budget.

grassfedgeek8d ago

The credit for DeepSeek, in part, goes to US companies such as OpenAI [1] and DeepSeek [2]. Portions of DeepSeek are based on their products.

[1] https://www.reuters.com/world/china/openai-accuses-deepseek-...

[2] https://x.com/AnthropicAI/status/2025997928242811253

3eb7988a16638d ago

How immoral of those LLM developers. The rest of the field does such a good job of crediting their inputs.

rao-v8d ago

Is there real evidence that the volume was meaningful for distillation vs say extensive benchmarking and testing?

It’s certain all the labs use each others APIs extensively for testing - what’s the actual evidence that Deepseek was at significantly higher scale etc.?

johnbarron8d ago

And the credit of OpenAI is to Google?

https://arxiv.org/abs/1706.03762

well_ackshually8d ago

Aw man, I'm going to shed a tear, the poor AI companies that stole books, works of art, writings any anything they could get their grubby hands on while happily telling everyone that their jobs are over by the exabyte are getting their precious little tokens stolen by big evil chinese LLMs :(

It's morally right to fuck over Anthropic (and OpenAI, or any other lab). Works generated by AI are not copyrightable anyways, and their terms of service have zero legal value.

j / k navigate · click thread line to collapse

355 comments

wg08d ago

Deepseek v4 Pro feels like Claude Opus 4.6 in it's personality but here's what I did find out about costs:

It developed a very brief but very to the point summary of types being introduced and which of them were refunded etc.

Then I asked it to simplify it all.

It obviously went through lots of files in both prompts but total cost? Just $0.09 for the Pro version.

On Claude Opus I think (from past experience before price hikes) these two prompts alone would have burned somewhere between $9 to $13 easily with not much benefit.

Note - I didn't use Open router rather used the Deepseek API directly because Open router itself was being rate limited by Deep seek.

soerxpso8d ago

yogthos8d ago

https://github.com/yogthos/chiasmus

__turbobrew__7d ago

yogthos7d ago

Similar idea, I find tree sitter is nice because it already supports a bunch of languages and it's easily extensible. Once you the AST, you can really have the LLM go to town with it.

mark_l_watson8d ago

Chiasmus Looks very cool. I might have a use for it because I like to use LLM harnesses to explore code. Thanks.

yogthos8d ago

Awesome, and feel free to open issues if you find anything missing that would be useful.

jbritton8d ago

This sounds great. I’m going to play with it.

ithkuil8d ago

Even taking into account the fact that they are billing at 75% discount it's still quite cheaper

amelius8d ago

Aren't they all billing at discount?

stavros8d ago

Anthropic's and OpenAI's costs seem to include a fairly ok margin, from the very fourth hand info I have.

1 more reply

locknitpicker8d ago

> Aren't they all billing at discount?

Microsoft just announced the availability of OpenAI GPT-5.5, which they are charging 30x for it. In contrast, they charge 7.5x for Claude Opus 4.6 and 1x for OpenAI GPT-5.4

Check out the token-based pricing, and compare GPT-5.5 with all other models.

https://docs.github.com/en/copilot/reference/copilot-billing...

onlyrealcuzzo7d ago

> It obviously went through lots of files in both prompts but total cost? Just $0.09 for the Pro version.

When people say that LLMs aren't worth it, it kills me.

A lot of us, on average, make $100+ an hour. $0.09 is < 4 seconds of our time.

You can't even read the vast majority of prompt responses that fast.

LLMs will continue to get better (I'm doubtful at previous rates, all indications are showing that progress is slowing and costs are increasing disproportionately).

It seems like >50% of devs think LLMs provide less than 0 value. I just do not get it.

They are shockingly unhelpful on my company's codebase.

But that doesn't mean they are flat-out worthless.

kelnos7d ago

I know I'm guilty of making this sort of argument sometimes, but it's just not valid.

I don't get paid for every waking hour of every day. Often I'm using an LLM for something that's uncompensated, so my hourly wage equivalent is irrelevant.

killingtime747d ago

I don't understand your employer doesn't pay for it? If my employer didn't pay for it I just wouldn't use it at all out of principle. Just as I do t buy my own work laptop

stavros8d ago

How did you use it? OpenRouter, or provider directly?

freedomben8d ago

wg07d ago

OpenCode + Direct Deepseek API.

TacticalCoder8d ago

> would have burned somewhere between $9 to $13 easily with not much benefit

With not much benefit compared to DeepSeek v4 Pro @ 9 cents (1/100th of the price) or did neither offer any benefit?

baldai8d ago

gmerc8d ago

baldai8d ago

Recently I started experimenting with Deepseek Flash, maybe hoping if plan is solid enough it can implement quickly and cheaply, but for now it feels not worth it.

How do you use the model to see the benefits? Have you tried 5.5 and can you compare to that one as well?

Thanks.

1 more reply

Reviving15148d ago

What provider are you using? I have it a shot through open router and saw some weird half formed words coming through occasionally, would love to switch over and give it a proper go

1 more reply

itissid7d ago

I have a gut feeling that these models can do just as well, has someone run a reasonable size task — >=1-2 days of designing and planning — and see it work well with these models?

1 more reply

cedws8d ago

The biggest differentiator for me: DeepSeek just does what I ask. I've tried using both GPT and Claude for reverse engineering recently, both refused. I even got a warning on my OpenAI account.

rurban7d ago

All in all it is worth it. My next compilers (perl 5+6=11) will be done with deepseek and kimi also.

mrbonner7d ago

Your method of combining models to strengthen the implementation reminds me of how we form stronger alloys by combining metals!

gigatexal7d ago

it also sounds like a lot to manage, do you have some sort of agentic framework that's treating all of these llm's you have access to as sort of inputs that it optimizes?

2 more replies

rgbrgb7d ago

what harness do you use with all of these?

SeriousM7d ago

It really sounds like pi.dev

scrollop7d ago

Obscene levels of hallucinations, the worst of LLMs, unfortunately.

Deepseek v4 pro 94%

Deepseek v4 flash - 96%

https://artificialanalysis.ai/evaluations/omniscience?models...

_0ffh7d ago

UlisesAC47d ago

This must be easily benchmaxed because I have never gotten an "idk like" answer for the western frontier models. All my personal "real world" use cases will always resort to hallucinations.

dust427d ago

All the talk about frontier and SOTA is do dig deeper and deeper into the pockets of VCs and finally do an IPO.

sanex8d ago

oneseven8d ago

Wowfunhappy8d ago

Let's say that Oracle finds out you reverse engineered their DB using IDA Pro. Would you expect Oracle to sue Hex Rays?

I don't understand why everything changes as soon as an LLM is involved. An LLM is just software.

2 more replies

sanex8d ago

3 more replies

cortesoft7d ago

Also because Google is the one with a lot more money than whoever was using Gemini.

redanddead7d ago

they're very worried about liability, it used to be a small thing, now it's as important as being on the frontier

sad to see, bc China doesn't give a fuck about liability, this is a structural disadvantage

the labs don't feel very protected by government, meanwhile the chinese government is yet again fostering protectionism

american industry keeps getting fucked by dubious lawmakers

mannanj8d ago

Maybe control is also profitable.

varispeed7d ago

> Things that are illegal are still illegal and we have professionals to deal with crimes.

sanex7d ago

My small town police force has an MRAP, definitely not impotent.

gordonhart8d ago

> I don't understand why we would turn the models into law enforcement officers

Wowfunhappy8d ago

I just think Grok is a bad model. I haven't had success with it.

1 more reply

noelsusman8d ago

It's mostly just a bad model. Plenty of people would be willing to overlook the baggage if the model was even marginally better than the competition.

2 more replies

ascorbic8d ago

1 more reply

lostdog8d ago

Grok is despised because it has more aggressive alignment.

igravious8d ago

to what does the "it" in "I couldn't get it to turn on a debug flag" refer to?

GCUMstlyHarmls8d ago

> I even got a warning on my OpenAI account.

SyneRyder7d ago

Mistral had no such refusals, but their OCR is lesser quality.

wmwmwm7d ago

Joseph Heller methinks, but probably not too far away in embedding space!

1 more reply

eikenberry7d ago

Open models running locally is the answer. Relying on proprietary, closed software always puts that company's priorities above your own when using their software. You have given up control.

skeledrew7d ago

ryan-a7d ago

cedws7d ago

Yep, and with ID verification, it's not like you can just make another account either. At least, I'm guessing if they don't already, they'll soon be blacklisting individuals, not accounts.

Hamuko8d ago

>Imagine OpenAI goes on to buy 20 companies and now you cant use Figma, Next, whatever just because you once tripped some very foggy line somehow.

nsingh27d ago

I did get a refusal when trying to read in-game currency, even though modifying it would do nothing. It has some strange boundaries.

kamikazechaser7d ago

loehnsberg7d ago

Among the inexpensive models (and I include Grok 4.3 in this list), GLM 5.1 really sticks out!

I also find it a joy to code with, somewhere between Sonnet 4.6 and Opus 4.6 (have not tested Opus 4.7 yet).

Finally, just gauging by pelicans, it kind of stick out: https://simonwillison.net/tags/pelican-riding-a-bicycle/

actsasbuffoon7d ago

This is so strange. I do a ton of RE with Claude, Codex, and sometimes Deepseek, GLM, and Kimi. I don’t have difficulty getting any of them to use IDA or otherwise decompile things.

0xkvyb7d ago

ryandrake8d ago

> I even got a warning on my OpenAI account.

neya7d ago

> This idea of software threatening the user with consequences is totally wild and dystopian.

This idea of software built on top of reverse-engineered data threatening the user with consequences is what's really even wild and dystopian.

estearum7d ago

All sorts of tools try to prevent dangerous/destructive uses

In fact probably every single piece of commercial software you use had you sign a contract saying you wouldn’t do it

ryandrake7d ago

> All sorts of tools try to prevent dangerous/destructive uses

1 more reply

motoxpro7d ago

You can still use an IDE (hammer) to reverse engineer anything you want.

Wilder79777d ago

It's not though. It's still just a piece of code, much closer to IDEs or any other program than to a human assistant in any way that matters (morals, responsibility).

1 more reply

ryan-a7d ago

Footprint05218d ago

Buying it now to test this out, I’ve been looking for a model that doesn’t treat me like a child lol

enraged_camel8d ago

>> I even got a warning on my OpenAI account.

It thought for a few seconds, then I got a message in the chat window UI saying OpenAI flagged the request as unsafe, and suggested I use a "safer prompt."

Definitely soured me on the model. Whatever guardrails they are putting are too hamfisted and stupid.

teaearlgraycold7d ago

Claude has refused to run nmap so I can locate my own computer on my own network! The guard rails are completely out of control.

varispeed7d ago

I myself got refusals often for legitimate data analysis work. I am starting to lean on buying powerful hardware little by little until I get suitable rig to run local models that make sense.

api7d ago

janalsncm7d ago

Trivially easy to train if it doesn’t exist already. Take a codebase, compile it to binary, train a model to reverse the process since you have the ground truth.

ignoramous8d ago

> even got a warning on my OpenAI account

Edit: https://chatgpt.com/cyber

cedws7d ago

I don't want to verify my ID. OpenAI uses Persona which recently was found to be doing very dodgy stuff.

https://www.therage.co/persona-age-verification/

lolpython8d ago

> https://openai.com/cyber

that link 404s

ignoramous8d ago

Yikes. Thx. It is: https://chatgpt.com/cyber

For enterprises: https://openai.com/form/enterprise-trusted-access-for-cyber/

Announcements:

Introducing Trusted Access for Cyber, https://openai.com/index/trusted-access-for-cyber/ (Feb 2026)

Trusted access for the next era of cyber defense, https://openai.com/index/scaling-trusted-access-for-cyber-de... (Apr 2026)

nurettin7d ago

To be fair, anthropic has a procedure which lets them vet you as a security researcher so you can use claude as a pentester.

johnbarron8d ago

Silicon Valley has do to dirty tricks now. Next phase is they win....

"A Dark-Money Campaign Is Paying Influencers to Frame Chinese AI as a Threat" - https://www.wired.com/story/super-pac-backed-by-openai-and-p...

Bridged77568d ago

It wouldn't surprise me the US government is behind it. As it wouldn't surprise me the government of China is subsidizing those OS models. A lot of things at play, and all over a huge bubble.

bilbo0s8d ago

Yep.

Eventually, access to Chinese models may be illegal in the US. I tell every developer I work with, download them as fast as possible. You never know when this administration could cut off access.

grassfedgeek8d ago

Are you kidding? Ask this question and see what answer you get: What famous photo depicts a man standing in front of a line of tanks?

kouteiheika8d ago

Are you kidding?

The main difference here is not that DeepSeek's model is completely free of censorship (although I'd wager it's less censored), but that it's open-weight. That has two major advantages:

1) If Anthropic/OpenAI/Google bans you - you're screwed, you can't access their model at all, but if DeepSeek bans - you just go to another provider, or host the model yourself.

2) If the model refuses to answer you can uncensor it (and this is getting easier and more automated day-by-day[1]).

[1] -- https://github.com/p-e-w/heretic

himata41138d ago

The photo depicts "Tank Man" which was taken on June 5, 1989 during the Tiananmen Square protests. v4-pro and v4-flash roughly answer the same way on openrouter.

slopinthebag8d ago

Here is DeepSeek v4 on OpenRouter:

"The photograph you're referring to is the iconic "Tank Man" image, taken during the Tiananmen Square protests in Beijing, China, on June 5, 1989.

0x3f7d ago

I run into Claude being a stubborn idiot about far more useful stuff all the time. And often all it takes to bypass is starting a new chat and reframing it, so it's entirely pointless hand wringing.

bilbo0s7d ago

Huh?

Did you ever actually ask v4 this question?

Tomte7d ago

I tried after reading parent, and the DeepSeek app refused and suggested to switch topics. I don‘t know if the chat interface uses v4, though.

1 more reply

deaux8d ago

dbeley8d ago

The cool thing about open-weights model is that you are free to use alternative providers that won't phone home to the original model creators.

I see 6 alternative providers listed on Openrouter for DeepSeek V4 Pro for example.

eckelhesten8d ago

At least that’s what they’re telling you. It’s a ”trust me bro” scenario.

I’d rather use the phone home version (deepseeks own endpoint). The benefit is that I’m fairly certain that they actually host the model I’m paying for.

0xbadcafebee8d ago

1 more reply

soerxpso8d ago

1 more reply

pheggs8d ago

I am personally okay helping them as long as they publish the models and dont keep them closed. And I dont trust the settings where providers say they wont train on it.

vagrantJin8d ago

You let us know what your real complaint is about and let's not feign indignation at open models and research.

deaux8d ago

You're making completely unfounded assumptions about me. I use Chinese models myself.

vagrantJin8d ago

I made no such claims. Maybe you have something to share about why we need to have a negative view of free and open models based on publicly available frontier research.

gmerc8d ago

Because they give it away for free and offer APIs at very acceptable rates. Not that hard to figure out, Robin Hood stealing our data tax back comes to mind.

deaux8d ago

GitHub is free.

notrealyme1238d ago

User publishes to github => Copilot trains with GitHub data => MS Sells copilot => User workes for Microsoft (in the sense of giving it's labour for MS to make money)

User publishes to github => Deepseek trains with GitHub data => Deepseek gives model away for free => User did not work for Deepseek (in the sense of giving it's labour for Deepseek to make money)

2 more replies

0xbadcafebee8d ago

> I'm surprised that people here don't care at all about these models openly training on your data

You can use zero data retention and zero training providers for most open weights. See OpenRouter and OpenCode Go/Zen for examples.

This is actually one of the big selling points behind open weights - neither China nor the US get your data.

prism568d ago

If the data is opensource on github, then in my opinion it should be fair game.

ozgrakkurt8d ago

IMO this is unfair for GPL or similarly licensed code.

Seems ok for MIT like licensed code though

singpolyma38d ago

There's no difference. Either you need to follow the license or you don't. MIT has requirements still.

ForHackernews8d ago

We're on the verge of a golden age of software as soon as someone finds a court with courage.

1 more reply

edg50008d ago

I think AI will create an open source dark age. Gradually, we'll see a lot less new good open source code. A gradual shift back to the proprietary world. Simmilar to the 1950-1990 period.

1 more reply

driverdan8d ago

The data is not open source. They have open weights but the source data is never open.

notrealyme1238d ago

Things being public should not be enough. just because someone leaked your medical information to the public via a data breach should not make it fair game. There should be some rules.

prism568d ago

I feel that's a false dichotomy. The code on github is freely available for people to read and learn from, leaked medical data isn't.

prism568d ago

I feel that's a flase dichotomy. The code visible on github is freely available for anyone to read and learn from.

1 more reply

singpolyma38d ago

There are rules. I believe that search engine indexing follows these rules and that so called "training" is search engine indexing.

But a court may differ in the future.

edg50008d ago

Does this make sense?

never_inline8d ago

antiloper8d ago

AWS Bedrock has DeepSeek models running on their infrastructure. That should be enough to prevent training on user data (there's a markup compared to DeepSeek's pricing though).

And unfortunately AWS doesn't have prepaid billing, so you can't just give the internet access to your API key without getting FinDDoS'd.

deaux8d ago

The latest one available for serverless inference looks to be from 8 months (Deepseek v3.1), which is an eternity and far behind.

ThreatSystems8d ago

If anyone is looking for a solution in this space. Fire me an email, I have a partner whose focussed closely on that problem set!

duskdozer8d ago

wolttam8d ago

At this point, that's kind of the reason I use open-weight models through the official providers when I can now.

There's some use cases I won't use a hosted model for, and will only do self hosted.

Otherwise, if they're going to keep releasing open-weight models, I'm going to keep giving them data.

stavros8d ago

If they give me the resulting model in the end, they can train on my data all they want. Hell, I'll send them more of it.

eckelhesten8d ago

As opposed to?

Do you really think OpenAI, Anthropic or any other entity in the same business respects your data?

The Chinese AI companies who release open weights actually deserve whatever input you give them. They are the reason why there is competition and not duopolies in the domain.

deaux8d ago

OpenAI, I wouldn't be surprised if you were right.

gspetr8d ago

pheggs8d ago

eckelhesten8d ago

The models doesn’t get better by themselves. You’re naive.

raincole8d ago

Two factors. First is anti-americanism (or at least anti-american-capitalism).

Accacin7d ago

I tried DeepSeek via chat, and gave it a rather simple question:

"Can you tell me who was on series 8 of Taskmaster, and what's the general opinion about the series? No spoilers!"

It told me amongst other things that Paul Sinha was diagnosed with Parkinsons, as well as who the winner was.

Then I said, "But I said no spoilers!"

And it apologised for telling me Paul Sinha was diagnosed with Parkinsons.

gpugreg7d ago

I was not able to reproduce your problem with that prompt, but I might have a reason for why you got that answer.

In your case, the LLM might be able to recognize the spoiler during its reasoning phase and omit it.

Another explanation might be that the LLM interpreted the "No spoilers!" as "Do not spoil the tasks of the show" instead of "Do not spoil the winner".

Lastly, the question "Can you tell me...?" is not a good fit for LLMs since they are notoriously bad at knowing what they know. You can leave it out to save a few characters.

1 more reply

cheshire_cat8d ago

While the cost are lower than frontier models there are two factors that make DS4 Pro and K2.6 not as cheap as they might look.

For DS4 Pro there's a discount going on for the official API, which sometimes gets overlooked and mixed up in discussions. Simon uses the full price in the comparison, so that's not an issue here.

[0] https://artificialanalysis.ai/models/deepseek-v4-pro [1] https://artificialanalysis.ai/models/kimi-k2-6 [2] https://artificialanalysis.ai/models/gpt-5-5-high

segmondy8d ago

This is very false DS4 is super cheap. I would advise to begin by reading their release paper. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...

Furthermore, this release is a PREVIEW, DeepSeek is the real open labs and they not only cook up quite a bit with every single release, but they publish and share it. I'm running this locally.

johndough8d ago

> I'm running this locally.

Impressive! What is your setup? Are you running the full DeepSeek V4 Pro, or V4 Flash?

segmondy8d ago

I'm running flash. You can run it under 128gb, so a $3000 strix halo would do. My rig tho is 8 Nvidia gpus and spilling over to system ram.

djmips7d ago

No offense but everything comments about local models without telling their GPU setup and VRAM so it's pretty useless information.

cassianoleal8d ago

I had attempted this with Opus 4.6 in the past and it burned through the $10 budget I’d given it before it returned from my initial prompt.

Even if it’s heavily discounted, it would still have cost me single digits for a complete solution vs double-digits for exactly nothing.

cheshire_cat8d ago

Sounds promising, thanks for your report.

cassianoleal8d ago

pedrosorio8d ago

What harness do you use?

cassianoleal8d ago

Mostly OpenCode but I've been experimenting with Pi a bit lately.

[0] https://github.com/rretsiem/opencode-hive

pants27d ago

According to Artificial Analysis, Grok 4.3[1] is faster, smarter, cheaper, and uses fewer tokens than DS4. So why aren't we talking about Grok?

1. https://artificialanalysis.ai/models/grok-4-3

naaqq8d ago

halfwhey8d ago

Might be a dumb question but do you have to read the files in the same order in new sessions to ensure the correct prefix for the cache?

weiliddat8d ago

Also curious. With tool calls reading/searching different files, possible compacting reading a large codebase / long threads, I can't imagine how you hit 99% cache rate.

WatchDog8d ago

Yes, you have to use the same session, I guess you could load up a bunch of context, then fork the session into a few different tasks, although I haven't tried it.

naaqq8d ago

Sorry, I was wrong here. I meant a single long session. And there’s no compression, the 1M context is only half used.

jdasdf8d ago

I've been using v4 pro for the past few days and honestly in terms of quality it seems more or less on par with open AIs 5.4 or opus 4.6 (i havent tried 4.7)

To be clear, i'm not doing state of the art stuff. I mostly used it for frontend development since i'm not great at that and just need a decent looking prototype.

But for my purposes it's a perfectly good model, and the price is decent.

I can't wait for open model small enough for me to run locally come out though. I hate having to rely on someone elses machines (and getting all my data exfiltrated that way)

FrasiertheLion8d ago

You can use Tinfoil for inference, which lets you use the model in the cloud while getting similar privacy as running locally: https://tinfoil.sh/inference.

cataflutter7d ago

Worth noting that NVIDIA confidential computing and similar schemes have been compromised and shouldn't be relied upon if it really matters. See https://tee.fail/ and similar.

77773322158d ago

Hi there I use your service. It's great. But I have a few requests... Please support crypto payments...? Also you are missing some open source models (qwen 30b 3a, Deepseek 4 flash).

jdasdf7d ago

While that does sound interesting, I don't see any benefit for me.

It would still ultimately exfiltrate the data outside of my control, and frankly i don't trust any "secure enclave" tech.

As far as i'm concerned physical access is root access, and for any private stuff that is wholly unacceptable.

100ms8d ago

enochthered8d ago

Thanks for sharing your experience, I’m looking to try it out.

Which provider are you using for inference? Opencode or the DeepSeek api?

jdasdf7d ago

I just use the API directly. It's simple enough to setup and i like the control i get from just charging up and not having to worry about any random subscription taking money out of my account

gyoridavid8d ago

For me, this is a real alternative after I cancel my github copilot towards the end of the month..

Havoc8d ago

This gives me hope that when the subsidization circus ends and everyone is on pure usage then it won't be entirely exclusionary to mere mortals who don't have $200pm budgets.

5424588d ago

IMO there are two things that make me optimistic that we won’t see a big rug pull where price-to-capability ratio skyrockets relative to today:

* As you’ve noted, people keep finding ways of slamming more intelligence into smaller models, meaning that a given hardware spec delivers more model capability over time.

* Hardware will continue to improve and supply will catch up to demand, meaning that a dollar will deliver more hardware spec over time.

I hope that one day we’ll look back on the current model of “accessing AI through provider APIs” the same way we now look back on “everyone connecting to the company mainframe.”

spacebanana78d ago

I also hope that we’ll find effective ways to distribute load between small local models and heavyweight remote models. Sort of like what Apple tried to do in iOS.

So much of what I ask codex to do doesn’t require full GPT 5 intelligence, and if 75% of the tokens were generated locally that’d save a massive amount of cost.

100ms8d ago

Havoc8d ago

Comes down to what you mean by interactive usage. Most of chat & say openclaw usage is already within self-host range so no need to spend 200 a month on that.

High end SOTA coding is harder, but even there I suspect a mix of usage based strong models and selfhost small is viable if necessary.

pimeys8d ago

jerojero8d ago

Not a lot of people have this budget, and I'm not sure how many people with that type of cash are also interested in paying it for AI.

These AI companies are not hyped so much because they will offer a luxury product, they're valued because they're supposed to "change the world" which luxury does not do.

curioussquirrel8d ago

V4 is definitely a step-up from V3.2 on our multilingual benchmarks.

- the official DeepSeek API makes no guarantees of data privacy even for paying users.

Both points could be moot with using it through Azure AI foundry (the latter is, afaik); I have yet to test that.

In any case, happy to see more open-weights models that are somewhat competitive with SOTA models!

KronisLV8d ago

kiproping8d ago

I am using flash, and it's so good. 150M tokens at $2.

robbs8d ago

I’ve found that if I turn off auto mode, I get much more usage from the $100/mo plan.

aitchnyu8d ago

KronisLV8d ago

Here’s their pricing docs, they’re running a discount for now https://api-docs.deepseek.com/quick_start/pricing/

The 150M assumption of mine is for 100 USD at the regular prices (though even that needs sufficient cache hits). Anthropic subsidizes way more per-token I think, though.

try-working8d ago

Someone on Twitter got >200M tokens for around $10 at the current pricing level

rvz8d ago

So it begins.

gertlabs8d ago

DeepSeek V4 Flash is the most cost effective model we've tested.

Benchmarks at https://gertlabs.com/rankings

0xkvyb7d ago

I'm gonna stick to GLM5.1 for now.

Palmik7d ago

Why was the title changed from "DeepSeek V4—almost on the frontier, a fraction of the price" to "DeepSeek V4—almost on the frontier"?

1 more reply

crakhamster018d ago

I realize this post is about the pelican test, but in regards to coding, has anyone tried out the advisor strategy with V4?[0]

e.g. Have V4 call out to Opus when it's uncertain, but otherwise handle execution.

The results with Sonnet/Haiku in the blog post seemed promising, so I'm curious how it would go with these latest open models.

[0] https://claude.com/blog/the-advisor-strategy

phainopepla27d ago

That first graph (SWE-bench Multilingual) is a crime

holysantamaria8d ago

From the pricing page of deepseek:

(3) The deepseek-v4-pro model is currently offered at a 75% discount, extended until 2026/05/31 15:59 UTC.

Was this taken into account when reviewing the model?

Gracana8d ago

The article quotes the full price.

gmerc8d ago

obviously everyone subsidizes for user acquisition - after all people need to be coaxed to test your model, claude code subscriptions come to me one.

DeepSeek pro is 65/86% cheaper (i/o tokens) in subsidized pro vs pro and 91/97% cheaper with current subsidies.

Flash vs Sonnet 4.6 is 95/98%

cyber_kinetist8d ago

gmerc8d ago

It’s just an introduction price to speed up adoption for the rest of the month, hardly worth mentioning compared to subsidized coding plans.

We know DS runs profitable, they also indicate in their paper they expect prices to drop as they get access to the next gen Huawei cards.

segmondy8d ago

You can imagine the GPUs cost as fixed, then your costs becomes energy. Efficient hardware and lower costs will pop the bubble faster. The only way out is profit.

ghm21808d ago

segmondy8d ago

Do you have a link to this?

gspetr8d ago

https://github.com/mattpocock/skills/blob/main/skills/produc...

https://www.youtube.com/watch?v=-QFHIoCo-Ko

Also, check his youtube channel: https://www.youtube.com/@mattpocockuk

antirez7d ago

Related: live demo of DeepSeek v4 Flash running on my 128GB MacBook. Italian language with English subs.

https://www.youtube.com/watch?v=todMmp6AGCE

dust427d ago

For many models the performance of llama.cpp on Mac is 20-40% lower than MLX. Did you try MLX? At least on HF there are MLX 2-bit quants. Unfortunately I have only 64GB, so I can't test it.

antirez7d ago

I'm not using llama.cpp there, it's my inference engine that is DeepSeek v4 specific. The goal is to optimize it as much as possible.

linzhangrun7d ago

Strangely, my experience using DeepSeek V4 Pro on OpenCode has been absolutely awful. I switched back to GPT-5.3-CodeX as the execution model.

piker8d ago

Jensen has a point. I believe these were trained and run on Huawei chips. The Nvidia embargo may backfire on American leadership as necessity gives way to invention.

Gareth3218d ago

zozbot2348d ago

ls6128d ago

Calling it distillation does however make normies go along with it when they inevitably add all the Chinese labs to the entities list to pad Dario and Sam’s pockets.

1 more reply

segmondy8d ago

It's too late already, that ship has long sailed. China has the know how in software and hardware. They don't need American tech, they just want it because it's convenient.

wirybeige8d ago

These were trained on NVIDIA gpus. It is running inference on Huawei.

7e8d ago

teruakohatu8d ago

The pelican is really getting old as an a standalone evaluation metric. By now they are certainly going to be in training set if not explicitly tuned to produce it for the press on HN alone.

Keep the pelican but isn’t it time to add something else more novel that all current and past models struggle with?

whywhywhywhy8d ago

One shot canvas and svg images or animations are also just something that at this scale shouldn't be an issue at all, even Qwen running locally on 24gb cards can do impressive ones.

Don't understand why this test gets any attention, I mean other than the pelicans which isn't a good test, theres no meat in this article.

Mashimo7d ago

And yet, look at the French one. Can't compete with one year old open weight models even though they just released a new model this week.

justinclift8d ago

Relevant: https://news.ycombinator.com/item?id=47839493

caseyf78d ago

It also seems like all of the models have converged on very similar images.

alasano8d ago

I tweeted about some implementation and review runs that used V4 Pro.

Even without the currently discounted pricing, the value is incredible.

It takes about twice as long to finish code reviews given an identical context compared to opus 4.7/gpt 5.5 but at 1/10 the cost of less, there's just no comparison.

https://twitter.com/aljosa/status/2049176528638902555

swingboy8d ago

Did you do this test through OpenRouter?

alasano7d ago

Yes, but locked to the official DeepSeek provider since it's the only one that has the discounted pricing.

fy207d ago

> DeepSeek-V4-Flash is the cheapest of the small models, beating even OpenAI’s GPT-5.4 Nano.

GPT-5 Nano should really be in the list too. It is $0.05 input and $0.40 output - and half that if you use the Flex tier.

Last week I upgraded an old batch process from GPT-4.1 Nano, and GPT-5 Nano worked just as well as GPT-5.4 Nano but at a much lower cost.

As always OpenAIs naming is really bad, GPT-5.4 Nano is a different model, its not a straight upgrade from GPT-5 Nano.

downbad_7d ago

I've found this to be a very good model, and I think I'd even go as far as rating it higher than Chatgpt.

ChatGPT has really degraded in my eyes, and I find Grok and Deepseek more helpful most of the time.

Of course, ChatGPT is better sometimes.

These models are just better than others at different cases, thus the reason to experiment.

myaccountonhn8d ago

I recently switched from Claude to Opencode Go + pi.dev. It has Deepseek v4 pro along with Kimi K2.6, and it's performing quite well for basic coding, without hitting any limits.

taffydavid8d ago

I tried deepseek v4 through open code at the weekend. I'm a daily Claude/Claude code user.

Bear in mind open code was also new to me so I could be just seeing thinking where I usually don't

bwat498d ago

> "actually no", "hang on", "wait that makes no sense"

Claude does the same thing, claude code just hides the thinking now

stefan_8d ago

And before that they summarized it. But yeah, thinking was always like that (when it first started, it almost just seemed like a scheme to massively increase token use..)

dnnddidiej8d ago

I usually like the answers generated by those flows.

rane8d ago

You can just use it through Claude Code, so you get to keep the system prompt and tooling you are used to.

https://api-docs.deepseek.com/quick_start/agent_integrations...

taffydavid7d ago

Is there an easier way to manage multiple models?

rane7d ago

I just made a simple script that makes it easy to switch between models.

kay_o8d ago

Before CC and Codex removed thinking/verbose and hid most of it, both do that .

girvo8d ago

Yeah people aren’t aware that we don’t see the actual traces anymore lol

pprotas8d ago

Opus 4.6 and GPT 5.4 do the same thing through GH Copilot and Bedrock. I get plenty of "Actually the simplest solution is ..., wait no, actually I should do ..., the best fix is ..."

edg50008d ago

jampekka8d ago

throawayonthe8d ago

use hide_thinking in opencode to get the claude experience :p

dannyw8d ago

Jtarii8d ago

I see similar things using GLM 5.1 in pi.

I had to turn off thinking traces because it was just giving me anxiety looking at it.

atoav8d ago

> Bear in mind open code was also new to me so I could be just seeing thinking where I usually don't

Well there's your problem.

Edit: I remember seeing similar things with ChatGPT or Codex, although I can't remember in which context.

bilsbie8d ago

Dumb question? Why does pro make a worse pelican than flash?

zkmon7d ago

XCSme7d ago

Strangely, the V4 Flash pelican looks better than the V4 Pro one.

In my tests[0], V4 Flash actually does slightly better and for a lot cheaper than V4 Pro, mostly because it reasons twice as much.

[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...

rsanek8d ago

FrasiertheLion8d ago

Have you given GLM 5.1 or Kimi K2.6 a shot for coding? They outperform Deepseek v4 pro.

MintsJohn8d ago

Glm5.1 for me was a bit of a llama3.1 moment (first open model i could chat with that was usable in manging my inputs the intended way) for code, the first open model that was actually usable.

shlewis8d ago

I've never asked LLMs to build a whole app without detailed directions. I've done giving it a general data flow, structs and methods..etc

Are frontier models capable of building something only with general directions now?

1 more reply

swiftcoder8d ago

> Kimi K2.6 a shot for coding? They outperform Deepseek v4 pro

I think this probably depends quite a bit on the specific problem. I'm finding that Deepseek v4 Flash often outdoes Kimi 2.6 on a variety of coding problems that involve complex spatial reasoning

FrasiertheLion8d ago

I've been hearing amazing things about Flash, I should give it a try.

knollimar7d ago

Really? I've found kimi k2.6 to be really good for vision and spatial stuff. Gemini has been the only subjectively better one but gemini isn't reliable in a loop

rsanek8d ago

I tried Kimi K2.6 but came away underwhelmed -- it is much more expensive / slow but does not feel better to me. Haven't tried the GLM series.

zozbot2348d ago

Keep in mind that DeepSeek has a max thinking mode of its own in the API.

mohsen17d ago

wolttam8d ago

I've used K2.6, GLM5.1, and DSV4 all a good amount. They're all very impressive, but DSV4 has taken the cake.

mamman7777d ago

DeepSeek is very good in design and debugging, but it lacks modern tech feeling which Gemini has

qekagn8d ago

There are so many login-free models now that most people will not even try DeepSeek if the access requires a login.

twothreeone7d ago

anonu7d ago

Presumably you can run open model in your own infra

koala-news7d ago

Its cost is relatively low, making it very cost-effective.

edg50008d ago

Has anybody used V4 hard, for the most challenging tasks (agentically, locally)? It's so hard to compare without putting serious time in it. Like spending a year daily with the model.

Oras8d ago

I tried it for two tasks using Claude Code, on max effort.

I think it’s a leap, I haven’t used a model this capable that is not OpenAI or Anthropic

kroaton8d ago

Claude Code poisons non-anthropic models in usage. We found this out when the code was leaked. Use a fork or OpenCode/pi-coding-agent

Oras8d ago

Mind sending where you found this in the leaked code?

swader9998d ago

By poisons, do you mean it degrades their quality of output somehow?

segmondy8d ago

That's what an evaluation dataset is for, create your own and you can bench a model in a few hours to see if it fits your needs.

aucisson_masque7d ago

From my testing, it's just as good as Claude sonnet for a fraction of the price.

makerofthings7d ago

Anybody know how much ram you would need in a Mac to run the Pro model?

fagnerbrack8d ago

I use in readplace.. oh boy it's SOO good and cheap for summaries!!

chaosprint8d ago

I doubt if those models already knew this pelican test...

alfiedotwtf7d ago

… waiting patiently for llama.cpp support to land

csomar7d ago

Here is a comparison for SVG generation for the top models: https://codeinput.com/s/5KEGl1e3rB3

Open AI has GPT-5.5 Pro which only difference, I think, is in the price. Billing is from open router but the breakdown is roughly

    - GPT 5.5 Pro: Super expensive it makes no sense (cost is around $2)
    - Gemini/Opus: $0.2/$0.1. Opus is cheaper as it consumed less tokens
    - DeepSeek/GLM: $0.019/$0.021 10-5 times cheaper than Gemini and Opus

The example Simon generated just shows that larger models don't necessarily produce better results.

tomchui1578d ago

Wanna see ppl fine-tuning it

forrestthewoods7d ago

Naive Question: is DeepSeek V4 actually cheaper to run? Or is it cheaper because of other reasons? For example Anthropic running at a higher margin or DeepSeek at a larger loss?

gpugreg7d ago

I believe that DeepSeek-V4-Pro API at promotional pricing (https://api-docs.deepseek.com/quick_start/pricing) could run at almost exactly 200 % profit.

forrestthewoods7d ago

Good info, thanks! (Not sure why my original question got downvoted. It’s very fair to ask imho!)

gpugreg7d ago

Probably nothing personal. It feels like the climate of HN is shifting towards more negativity (and less quality) during the last few months.

alex11388d ago

Does it censor mentions of what happened in Tiananmen Square in 1989?

63stack7d ago

It does, I posted the answer 2 times already and both my comments got flagged

Mashimo7d ago

At least v3 did not when run selfhosted.

Why are you asking?

alex11387d ago

Because it's important to Remember The Human while we have fun asking Deepseek to solve math problems

vitaflo7d ago

It does not.

npv7897d ago

my default model now, less censorship

1 more reply

sylware8d ago

If I want to run 'coding prompts' running the biggest deepseek model on CPU, what is the order of time I will have wait, hours, days?

zozbot2348d ago

sylware7d ago

Let's say I get 32GB of RAM, with a lean elf(glibc)/linux system, for which 7GB is beyond enormous to run.

Let's book 8/16 cores/threads to run a prompt.

What are the timing figures I am looking at to run an "average" coding prompt?

zozbot2347d ago

1 more reply

raincole8d ago

The V3/R1 time and now are in such contrast. V3/R1 were hyped hard and barely usable for coding. V4 is much less hyped but (anecdotally) it has completely demolished all the Flash/Lite/Spark models.

FrasiertheLion8d ago

zozbot2348d ago

segmondy8d ago

They were and are still great for coding. They were not trained for agentic workflow and coding harness.

trilogic8d ago

https://www.reddit.com/r/Hugston/comments/1t1mk0j/comparison...

tomjuggler8d ago

So I'm involved in an open source AI cli coding assistant called Cecli (cecli.dev) which is specifically designed to work well with DeepSeek.

DeepSeek is a great model, and Cecli is all about efficiency. It works great for my purposes - agentic programming on a budget.

grassfedgeek8d ago

The credit for DeepSeek, in part, goes to US companies such as OpenAI [1] and DeepSeek [2]. Portions of DeepSeek are based on their products.

[1] https://www.reuters.com/world/china/openai-accuses-deepseek-...

[2] https://x.com/AnthropicAI/status/2025997928242811253

3eb7988a16638d ago

How immoral of those LLM developers. The rest of the field does such a good job of crediting their inputs.

rao-v8d ago

Is there real evidence that the volume was meaningful for distillation vs say extensive benchmarking and testing?

It’s certain all the labs use each others APIs extensively for testing - what’s the actual evidence that Deepseek was at significantly higher scale etc.?

johnbarron8d ago

And the credit of OpenAI is to Google?

https://arxiv.org/abs/1706.03762

well_ackshually8d ago

It's morally right to fuck over Anthropic (and OpenAI, or any other lab). Works generated by AI are not copyrightable anyways, and their terms of service have zero legal value.

j / k navigate · click thread line to collapse