I only hit the 5 hour limit every few days and the weekly limit a day or two before it resets at the most aggressive. I wouldn’t expect my usage to increase dramatically, other than not being stopped by limits.
I’m still apprehensive about shipping all my stuff off to a lab under an adversarial government (to the US), so not just looking at this from a pure cost basis, but my question is from the cost lens at the moment.
It's maybe not quite as knowledgeable as the most expensive American models and maybe makes more mistakes (just a feeling based off of vibes, don't take my word for it), so you need to constrain its scope more. That suits my workflow, half the time I have it generate code in the chat window and then write it myself, and I'm mostly using it at the level of generating function bodies and stuff, not entire features. Although it is writing a lot of SwiftUI without me really knowing the language and doing a fine job as far as I can tell (which isn't much admittedly).
One benefit I don't see talked about is it's speed - it's really quick, doesn't spend too much time reasoning even on "max", and the flash model is pretty dang good too. This lets me get into "flow state" when I'm writing code, compared to my experiences with Codex and Opus which would take minutes to complete even basic tasks and kind of ruined my focus.
It's so cheap though, you could download a different harness (Crush, OpenCode, Pi etc) and load $5 in credits and test it for yourself.
export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
export ANTHROPIC_AUTH_TOKEN= *** PUT YOUR DEEPSEEK KEY HERE ***
export ANTHROPIC_MODEL=deepseek-v4-pro
export ANTHROPIC_DEFAULT_OPUS_MODEL=deepseek-v4-pro
export ANTHROPIC_DEFAULT_SONNET_MODEL=deepseek-v4-pro
export ANTHROPIC_DEFAULT_HAIKU_MODEL=deepseek-v4-flash
export CLAUDE_CODE_SUBAGENT_MODEL=deepseek-v4-flash
export CLAUDE_CODE_EFFORT_LEVEL=max
I started by using it for some bigger reading jobs, particularly when I was near limit. Honestly, it's not quite as good, but it's much cheaper, and means I can carry on working. I also find sometimes it's good to ask claude and deepseek to consider code, how to polish, it see what they both say.> I’m still apprehensive about shipping all my stuff off to a lab under an adversarial government (to the US)
Do you mean you don't want to use the models created by a non-US lab? In that case, yes you're stuck with US models, but there's a half dozen big labs in the US. If you meant just where your inference is done, there are providers in 12 different countries through OpenRouter, including the US. Several subscription providers host in multiple countries. There's a lot of choices.
As usual, different models get stuck on different things. I run DeepSeek v4 API for most of my Cursor experimentation / poking around / proof of concept stuff, but I trust it less than OpenAI/Claude for writing production code. Sometimes DeepSeek is great for debugging, planning, etc. Sometimes it gets stuck or outputs low quality. That's true of OpenAI and Anthropic models as well though.
Overall, DeepSeek seems serviceable but a rung below Opus 4.8 and GPT 5.5. I run them all on maximum thinking settings.
Repo reference here: https://github.com/aravindhsampath/agentic-template
DeepSeek and Xiaomi's deals on cache reads go with their models' latest gens making caching cheaper (using less space for KVs). No open-model inference provider has decided to match the pricing. I'm sure that says something about how inference pricing works, but not completely sure what.
Agree with others that top open models aren't on the frontier, and I would expect differences doing big-picture planning or anywhere you're only giving broad brushstrokes and looking for a lot to be guessed. But they do seem fine at coding from a a concrete plan! No experience in huge codebases because I only use them outside work, but they seem good enough about gathering info before they dive in that I'd expect them to grep around as they need.
An annoying caveat: individual subscription plans, used heavily, are much cheaper than the API -- see https://she-llac.com/claude-limits -- which complicates any argument about cost. I still think open models are worth playing with. They're one of the things that let us treat this as a technology rather than just as the product offerings of one of a few companies.
Hardest stuff i threw at it... i did like a set of 3 each for claude/gpt/ds, it was all pretty steady across all providers. I think claude won but it could have just been it rng'd into the 3 easier tasks, they are all similar tasks but not identical, these aren't like benchmark tasks just a steady flow of annoying html/json/regex type stuff. Almost always they need a second pass regardless of what model i throw at it, just to tighten up some loose ends, and it fit right into what my current expectation was of gpt 5.5 and opus 4.6.
For evals in particular (tuning workflows that agents are using), effectively not having to worry about price is an incredible multiplier - getting statistical significant signal is not cheap otherwise.
https://artificialanalysis.ai/evaluations/omniscience
Esp check the Hallucination rate for Deepseek - it's not good.
For strongly-typed coding tasks - and I imagine other tasks that have cheap validity checks: agentic harnesses and thinking tokens are an effective foil against hallucinations, at the expense of time. If a model hallucinates an API, compilation will fail and the error fed back into the machine so it can try again, in a two-steps-forward-one-step-back dance that is unreasonably effective. Given the price delta, it is often more cost effective to let the weaker model spiral towards a solution with many "Oh, wait..." turns
What is that claim based on?
if it's 99.9% comparable performance for less money I'm interested, but I'm skeptical it's there
The "intelligence" is clearly there now. Trying to measure it seems pointless. I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce. That is clearly an insane ask, but that's approximately what is being pushed for with these models now.
Domain specificity (harness & environment) is where the magic happens next. I intentionally use a slightly less powerful model to help reveal weakness in how I've exposed the domain to the model. Having capability reserves available dramatically increases confidence around a project like this. If the customer starts to complain about some edges, I can crank them up to gpt5.5 for target scenarios. If I'm already on 5.5 there's nowhere else to go. I'm up against the wall.
I wonder if I am using the same models as everyone else. To me, LLMs still give good answers 80% of the time, but 20% it fails in such a miserable way that makes it obvious that the "intelligence" is not there.
But when an LLM does it on an area we know, we notice and suddenly it's too much.
The "works for me" is telling more about the field of the LLM reviewer, then the LLM.
I still find things to tweak and fix up but the amount dropped pretty dramatically. As always I am responsible for what I ship so I review and test everything of course. I still think we are a ways away from fully automated software forge but what is currently possible is pretty cool.
An auditing/QA step (whether a grading checklist, verification, etc) can get you further. Likewise for a planning step.
That being said the models still surprise me with a broad range of hallucinations, lack of epistemology or common sense or inability to follow instructions on a daily basis.
Today it was trying to get opus 4.8 to just follow a simple architectural pattern for controllers in a rails app. It was pulling teeth out of a shark.
Already the fact that we could have to ask "there where", the fact that we have met clearly unintelligent bots, creates a requirement about defining where it (intelligence) is and investigating what put it there, to get the warranties that intelligence will be met consistently, structurally, and not casually, apparently.
Casual use, casual tool; mission critical use, certified tool.
not really. it happens in training and RL. your harness is not going to override what it has been trained to do.
sure harness is useful if you are trying to build crud websites if model is trained on stamping out crud websites. But thats just a waste of time remxing things better.
We are just getting into the nitty-gritty of LLM benchmarking - to be fair they still need to go a long way still IMO. But it's incredibly exciting that a local run LLM is capable of producing similar results as a SOTA model.
What? You can and you should. That's exactly what product tests are enabling you to do. If you need a glue, you want to look at someone who tried to glue some things with few glues so you know what to roughly expect form which specific glue.
The article reads like thin, auto-generated ai clickbait for nerd sniping or shilling a model.
Consider the lead:
> DeepSeek V4 Pro wins this head-to-head by being more exact where it matters: following instructions, matching schemas, and solving edge cases cleanly. GPT-5.5 Pro is still strong, but it gave away points with avoidable deviations.
“where it matters”, “cleanly”, “is still strong”, and vague references instead of telling 3 out of 4 tests Deepseek yielded more concise results.
1 star.
Per Merriam-Webster [^1], a lede is:
> the introductory section of a news story that is intended to entice the reader to read the full story
(Emphasis mine)
You may prefer more matter-of-fact phrasing, of course, but criticising a lede for attempting to achieve its goal is unjustified.
So dismissing it on technicalities is for sure clever but also obvious and lame.
The Letter/spirit thing eventually got boring. Please find better material
Filling it with slop constructs signals the reader no effort was made writing the article. So no effort should be put into reading it.
The rest of the article is equally flimsy. Great clickbait title, perhaps that is even harder than writing a lede.
I am not a native speaker :)
I found the writing clear and quite even handed. The lead is a bit salesy, but leads typically are. Knee-jerk dismissals based on vibes that something is LLM generated are quite low-effort.
It shows DeepSeek is competitive, if not better sometimes, than GPT 5.5. Also shows there is no moat. As such it is a highly significant signal.
An X5 is not simply “inferior” to a CR-V, or vice versa. A Camry is not “inferior” to an F-150, or vice versa. They are optimized for different buyers, budgets, constraints, and use cases.
That may actually be the better analogy for AI models: there probably is not one universal “best” model. There are models that are better or worse for particular tasks, price points, latency requirements, deployment constraints, privacy needs, etc.
No one ever says this about the “pelican on a bicycle” metric
GPT 5.5 Pro found two out of four cases that it got to before blowing its budget. Maybe it would have been the best of the bunch with infinite budget, but Opus 4.8, DeepSeek V4 Pro, and MiMo 2.5 Pro found four of nine of the bugs. Opus was an order of magnitude cheaper than GPT 5.5 Pro (and something like 30% cheaper than GPT 5.5), DeepSeek and MiMo were two orders of magnitude cheaper at roughly a dime per case.
GPT Pro also chews a lot and a long time, relatively speaking.
I can't come up with a use case where I can rationally spend ~31 times what Opus costs to use GPT 5.5 Pro, and I won't be doing any more benchmarking with it.
Given how much token costs are becoming an issue people talk about, the fact that there are models that cost dramatically less than the big American providers is going to be an issue for Anthropic and OpenAI. I'm happy to pay a premium (within reason) for the best model for interactive coding, but for API use, where having the model repeat it itself, compare against other models, have models judge other models work, etc. is not time-consuming for a human and is just a matter of implementing the harnesses and framework for proving correctness, I can't come up with a reason to spend ten or two hundred times as much as DeepSeek.
> With $3.88 & 690,003,591 tokens and 5 hours, Deepseek Pro & Flash combined, managed to reverse engineer Teamspeak's Licensing System for 3.13.8 (latest of post)
https://www.reddit.com/r/DeepSeek/comments/1txcfrh/with_388_...
This is some of the funniest stuff I've read in a while
9 bugs is probably a bit low of a sample size to get a ranking.
That being said the ranking does end up roughly how you'd expect.
Deepseek is Pro, right? Not Flash? I've been using Flash for a lot of smaller tasks and finding it reasonably good. It's good for "interactive" use. Very fast, does small tasks nearly instantly.
It's also decent for investigating large codebases. I wonder if it could do security work too.
DeepSeek was actually the `deepseek-chat` alias in the API (which dynamically chooses the model based on info I don't know), but when I checked the usage, it was all DeepSeek V4 Pro for the benchmark. I later changed DeepSeek to explicitly use Pro for subsequent experiments, so future runs will be explicitly Pro.
I probably will do a test of smaller models, exclusively, at some point. But, I figured DeepSeek V4 Pro is so cheap, especially given their caching effectiveness and cached input pricing, for my own use I'll probably just use DeepSeek V4 Pro when I need a cheap, fast, near-frontier model.
And nice to see the cheap models doing so well.
I don't know whether models are over fitted to benchmarks and people take them at face value, but I spend less on DS4 apis than I do for Claude Code 100$ subscription and I code everyday. So far I'm quite happy with the results.
The remaining 5% of time you get a big boost for your high-reasoning problem solving needs and evade a lot of pain. Now, I just need to be able to predict accurately when I need this extra 5% and when not :)
I don't feel like paying 100 times the price for a 1-5% better tool.
However, that's probably not how most professional developers use LLMs. I tend to give well-specified, more constrained tasks, and for those, I find that Opus performs worse than other models precisely because it tends to infer unstated requirements and do things I didn't want it to do. In this situation, GPT 5.5 works better for me because it only and precisely does what I ask it to.
It worked for me too, for months, when I was working on trivial web projects.
Around February of this year it got lobotomized and I quit my subscription end of march.
I am not going back.
> We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had grok-4-1-fast-non-reasoning score each one. DeepSeek: DeepSeek V4 Pro scored 38.0 to OpenAI: GPT-5.5 Pro's 33.0.
Requests to grok-4-1-fast-non-reasoning now silently route to grok-4.3 (a 5x more expensive model), with reasoning set to "none".
https://docs.x.ai/developers/migration/may-15-retirement
TFA was published today, which implies grok-4.3 was used.
Hopefully this dynamic continues long enough to make local/private inference the leading solution for coding.
As for other segments, high API pricing gets people to switch to the subscriptions instead which is stickier than the API.
So it doesn't surprise me at all that the methodology is weak, too.
1. MoE (nothing new here, but, this helps a lot)
2. Compressed Attention Mechanisms (this is their core innovation) - this dramatically reduces the Key-Value (KV) cache requirements for longer contexts
Another thing that helps is significantly lower energy costs in China.
Another point from my own guess: they are running (some percentage) the inference on their own home-grown AI inference chips.
Every little improvement would save them billions, so it's hard to imagine they aren't pouring a lot of resources into that already.
My guess is that they do aggressive caching / some proprietary optimizations in their hosting setup that they haven't published. Maybe also running at loss to gain market share.
And judging from latency / network performance, I don't think what you access, when you access deepseek.com from Europe, is hosted in China.
We've been using it for async "heartbeat" processing and sms replies, but it's just too slow for live chat replies (which is a shame, as I'd really love to use it there).
Very capable model, but also very slow.
At the moment of writing https://news.ycombinator.com/item?id=48343690 MiMo V2.5 Pro had a lower cache hit ratio. From the article:
OSS models, depending on who you use them from, make a huge difference, mostly due to cache-hit rates.
Model Cheapest effectiveInputPrice (Provider)
MiMo-V2.5-Pro 0.3720 (Xiaomi)
DeepSeek V4 Pro (Max) 0.0560 (DeepSeek)EDIT: okay I misread it, does this mean that DeepSeek reuses a higher percentage of tokens at cache price that MiMo, am I right?
An AI generated article about single ai run test which in theory had many components and the AI judge declared deepseek "won"?
How many runs were there on each test to account for some temperature variance? Only one.
Did deepseek write better code? Did GPT's code have bugs when doing the regex? The AI "news" article doesn't actually say that. It says that grok thought that GPT's approach could have bugs so it declared deep seek the winner.
This is absolute worthless methodology. And barely measurable methodology - nothing more than a prompt. No definition of what the scoring approach actually is. No definition of what "precision" actually means in this context. This is absolutely worthless and has no business being in the site, forget about on the front page.
So why is it's on the front page? Because it aligns with the current "feels" of the community that deepseek will get better and it shows "bad things" about the en vogue to dislike closed models.
I happen to agree with both of the views, but this site is utterly worthless.
If you want HN to be astro-turfed to the max, just up vote content like this without any critical reading of the.
I mean the past 6 months of "here is my chat gpt blog post of how to use a coding agent" are 1000x better than this "news article".
Seriously the amount of respect I've lost recently for the HN community is incredible. A bit harsh, but very true.
Maybe it's generational thing, maybe it's due to the state of politics, maybe it's a side effect of me getting older, but recently online has turned into nothing but people explicitly (or implicitly) writing about their "team". Comments on this post are nothing but people who clearly see themselves as being on "team deepseek" or "team open models" or some similar variant writing posts in support even though this is probably one of the worst "articles" to make it to the front page on ages.
It clearly doesn't matter. It supports something on their "team" so they support it via comments.
If kills any form of intellectual discussion. It's all just "this is my team".
... and I believe which is happening. I've been advocating for DeepSeek V4 Pro and no one paid me. It's almost too good to be true.
Also, which SOTA western models are you comparing it with? Just to give more flavor.
1. DS4Pro: around opus 4.5
2. DS4Flash: around sonnet 4
3. Mimo v2.5 pro: between opus 4.5 and opus 4.6.
4. minimax M3: around opus 4.6
All of these are very close in terms of quality and pricing. For anything that is not specifically related to coding, DS4Flash has become ny de-factor model. It just works... super fast, tool calling is perfect, and the price is unbeatable. Caching is out of the world. Im now regularly hitting 90%+.
I don’t know what it is specifically, but my weak human pattern-matching skills find this kind of language increasingly revolting. I don’t know why it is revolting, per se. It’s just the feeling I get.
Of course, me saying this on HN will get incorporated into GPT-5.6.175 or Claude 4.93 and it will make some version that just moves the revolting frontier elsewhere…
"Harry finally had control of the broom. Draco was dead in his sights. The matchup feels earned."
The best valuable part of DeepSeek V4 pro is its low price, I don't expect have much better performance than GPT-5.5, even it's just the performance like gpt-5.4, it's still a good model.
Expectations are not always reality. Give the model a try. I just stuck with flash tbh, didn't even use pro. I do webdev in PHP.
It helps but you often have to step in the failure cases and guide them or forcibly fix certain paths to get a solution.
If I can describe the problem and its solution well enough, Flash just does it.
If I can’t (or am feeling too lazy to) describe the problem well enough, and can only describe the desired outcome, then I’ve noticed models like GPT 5.5 being clearly better at working out a solid solution on their own.
There are some clear differences in the capabilities of the models, but it’s also clear that smaller open weight models are good enough to be a huge help for most tasks.
More seriously, LLM eval is totally broken judging by the related articles on HN.
It’s also quite affordable, at my current usage the DeepSeek tokens cost approx. the same as my Anthropic Max 100 USD subscription, though that’s also because DeepSeek generally needs more tokens.
I’d say I have fairly moderate usage, the DeepSeek dashboard shows around 100 million tokens per day, but almost all of it cache. Without cache it’d be like 1.5 million in and 0.5 million out most days, sometimes double, other times half.
Used it with Claude Code for a while, though I have to admit that using OpenCode with DeepSeek just sparks joy. Tone wise, it’s also a bit less obnoxious than Opus sometimes, though the flip side is that it’s wrong more often and sometimes just does dumb shit when it comes to code.
there are models you can speak with, that respond to what you say
and there are models that just make lists, that list everything, include weird formats and add asteriks everywhere.
deepseek, to me, will always be the latter, and i can't stand it, you can't ask it a coherent question and get a coherent response.
Grok: Hitler did nothing wrong!
ChatGPT: Altman did nothing wrong!