story
Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
4.7 comes out slightly cheaper than 4.6. But 4.5 is about half the cost:
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.
I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.
On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)
https://news.ycombinator.com/item?id=47668520
People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.
I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.
> What's the difference between this and option 1.(a) presented before?
> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.
> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.
> You were right to push back. I was wrong. Let me actually trace it properly this time.
> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.
It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.
Can provide session feedback IDs if needed.
In my experience, prompts like this one, which 1) ask for a reason behind an answer (when the model won't actually be able to provide one), 2) are somewhat standoff-ish, don't work well at all. You'll just have the model go the other way.
What works much better is to tell the model to take a step back and re-evaluate. Sometimes it also helps to explicitly ask it to look at things from a different angle XYZ, in other words, to add some entropy to get it away from the local optimum it's currently at.
Do you think it knows what max effort or patched system prompts are? It feels really weird to talk to an LLM like it’s a person that understands.
It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.
Does it? Anthropic's own announcement says that for the same "effort level" 4.7 does more thinking (i.e uses more output tokens) than 4.6, and they've also increased the default effort level from 4.6 high to 4.7 xhigh.
I'm not sure what dominates the cost for a typical mix of agentic coding tasks - input tokens or output ones, but if you are working on an existing project rather than a brand new one, then file input has to be a significant factor and preliminary testing says that the new tokenizer is typically generating 40% or so more tokens for the exact same input.
I really have to wonder how much of 4.7's increase in benchmark scores over 4.6 is because the model is actually better trained for these cases, or just because it is using more tokens - more compute and thinking steps - to generate the output. It has to be a mix of the two.
"Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens. "
I'm not sure where that discrepancy comes from (is Anthropic using different benchmarks?).
There's a few different theories but all we have now are synthetic benchmarks, anecdotes and speculation.
(Benchmarks are misleading, I think our best bet now is for individuals to run real world tests, giving the same task to each model, and compare the quality, cost and time.)
The input cost inflation however is real, and dramatic.
I would have expected them to lower input costs proportionally, because otherwise you're getting less intelligence per dollar even with the smarter model. Think that would be the smartest thing for them to do, at least PR wise. And maybe a bit of free usage as an apology :)
Agree though that benchmarks aren't very helpful w.r.t. estimating real world performance or costs.
What we'd need are people giving the same real world tasks to 4.6 and 4.7 and measuring time, quality and costs.
I hit my 5 hour limit within 2 hours yesterday, initially I was trying the batched mode for a refactor but cancelled after seeing it take 30% of the limit within 5 minutes. Had to cancel and try a serial approach, consumed less (took ~50 minutes, xhigh effort, ~60% of the remaining allocation IIRC), but still very clearly consumed much faster than with 4.6.
It feels like every exchange takes ~5% of the 5 hour limit now, when it used to be maybe ~1-2%. For reference I'm on the Max 5x plan.
For now I can tolerate it since I still have plenty of headroom in my limits (used ~5% of my weekly, I don't use claude heavily every day so this is OK), but I hope they either offer more clarity on this or improve the situation. The effort setting is still a bit too opaque to really help.
It decided to leave the write endpoints added to an authentication service completely unauthenticated. The effort to do the contrary was about 6 characters, and in the claude.md. It tried to implement PKCE by embedding _everything_ in the state.
This thing is beyond untrustworthy.
The fact that they are using Claude to build Claude (not just Claude Code) probably explains a lot.
Why can't they save the kv cache to disk then later reload it to memory?
if it's the latter that's crazy. i dont even know what to do there, compactions already feel like a memory wipe
And yes, Claude models are generally more fun to use than GPT/Codex. They have a personality. They have an intuition for design/aesthetics. Vibe-coding with them feels like playing a video game. But the result is almost always some version of cutting corners: tests removed to make the suite pass, duplicate code everywhere, wrong abstraction, type safety disabled, hard requirements ignored, etc.
These issues are not resolved in 4.7, no matter what the benchmarks say, and I don't think there is any interest in resolving them.
It seems that they got a grip on the "coding LLM" market and now they're starting to seek actual profit. I predict we'll keep seeing 40%+ more expensive models for a marginal performance gain from now on.
Just to get a sense for the rate of change, imagine if you took a survey. Compare what people said about AI tools... 3 years ago, 2 years ago, 1 year ago, 6 months ago. Then think about what is plausible that people will be saying in 3 months, 6 months, 9 months ...
Moving the goalposts has always happened, but it is happening faster than I've ever seen it. Many people seem to redefine their expectations on a monthly basis now. Worse, they seem to be unaware they are doing it.
Fancy search? Ok, I'll bite. Compare today's "fancy search" to what we had ~3 years ago according to your choice of metric. Here's one: minutes spent relative to information found. Today, in ~5 minutes I can do a literature review that would have taken me easily 10+ hours five years ago. We don't need to argue phrasing when we can pick some prototypical tasks and compare them.
We're going to have different takes about where various AI technologies will be in these future timelines. It is much better to run to where the ball is likely to be, even if we have different ideas of where that is.
The human brain, at best, struggles to grasp even linear change. But linear change is not a good way to predict compounding technological change.
I’m definitely not coming to this from a “AI is useless” angle. I’ve been using these tools extensively over the past year and they are providing a massive productivity boost.
However when you guide the AI as a constant, and the model behaves MUCH differently (given a baseline guide), that is where the problem lies.
It's as if your 'guidance' has to be variable on how well the model is behaving. Analogy is a junior dev who is sometimes excellent, and sometimes shows up drunk for work and you have no breathalyzer.
Is that what the soul is?
This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.
My prior: it is 10X to 20X more likely Anthropic has done something other than shift to a short-term squeeze their customers strategy (which I think is only around ~5%)
What do I mean by "something other"? (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded. (2) Another possibility is that they are not as tuned to to what customers want relative to what their engineers want. (3) It is also possible they have slowed down their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos). Also, the above three possibilities are not mutually exclusive.
I don't expect us (readers here) to agree on the probabilities down to the ±5% level, but I would think a large chunk of informed and reasonable people can probably converge to something close to ±20%. At the very least, can we agree all of these factors are strong contenders: each covers maybe at least 10% to 30% of the probability space?
How short-sighted, dumb, or back-against-the-wall would Anthropic have to be to shift to a "let's make our new models intentionally _worse_ than our previous ones?" strategy? Think on this. I'm not necessarily "pro" Anthropic. They could lose standing with me over time, for sure. I'm willing to think it through. What would the world have to look like for this to be the case.
There are other factors that push back against claims of a "short-term greedy strategy" argument. Most importantly, they aren't stupid; they know customers care about quality. They are playing a longer game than that.
Yes, I understand that Opus 4.7 is not impressing people or worse. I feel similarly based on my "feels", but I also know I haven't run benchmarks nor have I used it very long.
I think most people viewed Opus 4.6 as a big step forward. People are somewhat conditioned to expect a newer model to be better, and Opus 4.7 doesn't match that expectation. I also know that I've been asking Claude to help me with Bayesian probabilistic modeling techniques that are well outside what I was doing a few weeks ago (detailed research and systems / software development), so it is just as likely that I'm pushing it outside its expertise.
I said "it seems like". Obviously, I have no idea whether this is an intentional strategy or not and it could as well be a side effect of those things that you mentioned.
Models being "worse" is the perceived effect for the end user (subjectively, it seems like the price to achieve the same results on similar tasks with Opus has been steadily increasing). I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).
>> This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.
> I said "it seems like".
Sorry. I take back the "presumptuous" part. But part of my concern remains: of all the things you chose to wrote, you only mentioned "the Tinder/casino intermittent reinforcement strategy". That phrase is going to draw eyeballs, and you got mine at least. As a reader, it conveys you think it is the most likely explanation. I'm trying to see if there is something there that I'm missing. How likely do you think is? Do you think it is more likely than the other three I mentioned? If so, it seems like your thinking hinges on this:
> I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).
No incentive? Hardly. First, Anthropic is not a typical profit-maximizing entity, it a Public Benefit Corporation [1] [2]. Yes, profits matter still, but there are other factors to consider if we want to accurately predict their actions.
Second, even if profit maximization is the only incentive in play, profit-maximizing entities can plan across different time horizons. Like I mentioned in my above comment, it would be rather myopic to damage their reputation with a strategy that I summarize as a short-term customer-squeeze strategy.
Third, like many people here on HN, I've lived in the Bay Area, and I have first-degree connections that give me high confidence (P>80%) that key leaders at Anthropic have motivations that go much beyond mere profit maximization.
A\'s AI safety mission is a huge factor and not the PR veneer that pessimists tend to claim. Most people who know me would view me as somewhat pessimistic and anti-corporate and P(doomy). I say this to emphasize I'm not just casting stones at people for "being negative". IMO, failing to recognize and account for Anthropic's AI safety stance isn't "informed hard-hitting pessimism" so much as "limited awareness and/or poor analysis".
I'm not naive. That safety mission collides in a complicated way with FU money potential. Still, I'm confident (P>60%) that a significant number (>20%) of people at Anthropic have recently "cerebrated bad times" [3] i.e. cogitated futures where most humans die or lose control due to AI within ~10 to ~20 years. Being filthy rich doesn't matter much when dead or dehumanized.
[1]: https://law.justia.com/codes/delaware/title-8/chapter-1/subc...
[2]: https://time.com/6983420/anthropic-structure-openai-incentiv...
[3]: Weird Al: please make "Cerebration" for us.
https://artificialanalysis.ai/?intelligence-efficiency=intel...
Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.
Tokenizer changes are one piece to understand for sure, but as you say, you need to evaluate $/task not $/token or #tokens/task alone.
Though, from my limited testing, the new model is far more token hungry overall
I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.
My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.
Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.
Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.
But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.
I find that Opus is really good at discerning what I mean, even when I don't state it very clearly. Sonnet often doesn't quite get where I'm going and it sometimes builds things that don't make sense. Sonnet also occasionally makes outright mistakes, like not catching every location that needs to be changed; Opus makes nearly every code change flawlessly, as if it's thinking through "what could go wrong" like a good engineer would.
Sonnet is still better than older and/or less-capable models like GPT 4.1, Raptor mini (Preview), or GPT-5 mini, which all fail in the same way as Sonnet but more dramatically... but Opus is much better than Sonnet.
Recent full-powered GPTs (including the Codex variants) are competitive with Opus 4.6, but Opus 4.5 in particular is best in class for my workflow. I speculate that Opus 4.5 dedicates the most cycles out of all models to checking its work and ensuring correctness — as opposed to reaching for the skies to chase ambitious, highly complex coding tasks.
as in 4.5 is no longer going to be avail? F.
ive also been sticking with 4.5 that sucks
> Over the coming weeks, Opus 4.7 will replace Opus 4.5 and Opus 4.6 in the model picker for Copilot Pro+[...]
> This model is launching with a 7.5× premium request multiplier as part of promotional pricing until April 30th.
After just ~4 prompts I blew past my daily limit. Another ~7 more prompts & I blew past my weekly limit.
The entire HTMl/CSS/JS was less than 300 lines of code.
I was shocked how fast it exhausted my usage limits.
With enterprise subscription, the bill gets bigger but it's not like VP can easily send a memo to all its staff that a migration is coming.
Individuals may end their subscription, that would appease the DC usage, and turn profits up.
The "small subset" argument is profoundly unconvincing, and inconsistent with both neurobiology of the human brain and the actual performance of LLMs.
The transformer architecture is incredibly universal and highly expressive. Transformers power LLMs, video generator models, audio generator models, SLAM models, entire VLAs and more. It not a 1:1 copy of human brain, but that doesn't mean that it's incapable of reaching functional equivalence. Human brain isn't the only way to implement general intelligence - just the one that was the easiest for evolution to put together out of what it had.
LeCun's arguments about "LLMs can't do X" keep being proven wrong empirically. Even on ARC-AGI-3, which is a benchmark specifically designed to be adversarial to LLMs and target the weakest capabilities of off the shelf LLMs, there is no AI class that beats LLMs.
The human brain is not a pretrained system. It's objectively more flexible than than transformers and capable of self-modulation in ways that no ML architecture can replicate (that I'm aware of).
I've seen plenty of wacky test-time training things used in ML nowadays, which is probably the closest to how the human brain learns. None are stable enough to go into the frontier LLMs, where in-context learning still reigns supreme. In-context learning is a "good enough" continuous learning approximatation, it seems.
And even then... why can't they write a novel? Or lowering the bar, let's say a novella like Death in Venice, Candide, The Metamorphosis, Breakfast at Tiffany's...?
Every book's in the training corpus...
Is it just a matter of someone not having spent a hundred grand in tokens to do it?
It's just that the ones that manage to suppress all the AI writing "tells" go unnoticed as AI. This is a type of survivorship bias, though I feel there must be a better term for it that eludes me.
There's a lot of bad writing out there, I can't imagine nobody has used an LLM to write a bad novella.
I provide four examples in my comment...
[Opus 4.6] 3% context | last: 5.2k in / 1.1k out
add this to .claude/settings.json
"statusLine": { "type": "command", "command": "jq -r '\"[\\(.model.display_name)] \\(.context_window.used_percentage // 0)% context | last: \\(((.context_window.current_usage.input_tokens // 0) / 1000 * 10 | floor / 10))k in / \\(((.context_window.current_usage.output_tokens // 0) / 1000 * 10 | floor / 10))k out\"'" }
After a few basic operations (retrospective look at the flow of recent reviews, product discussions) I would expect this to act like a senior member of the team, while 4.6 was good, but far more likely to be a foot-gun.
We'll be keeping an eye on open models (of which we already make good use of). I think that's the way forward. Actually it would be great if everybody would put more focus on open models, perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company. Wouldn't it be nice if we don't need to pay for tokens? Paying for infra (servers, electricity) is already expensive enough
One of two main reasons why I'm wary of LLMs. The other is fear of skill atrophy. These two problems compound. Skill atrophy is less bad if the replacement for the previous skill does not depend on a potentially less-than-friendly party.
It was an experiment to see if I could enter a mature codebase I had zero knowledge of, look at it entirely through an AI, and come to understand it.
And it worked! Even though I've only worked on the codebase through Claude, whenever I pick up a ticket nowadays I know what file I'll be editing and how it relates to the rest of the code. If anything, I have a significantly better understanding of the codebase than I would without AI at this point in my onboarding.
>I have a significantly better understanding of the codebase than I would without AI at this point in my onboarding
One of the pitfalls of using AI to learn is the same as I'd see students doing pre-AI with tutoring services. They'd have tutors explain the homework to do them and even work through the problems with them. Thing is, any time you see a problem or concept solved, your brain is tricked into thinking you understand the topic enough to do it yourself. It's why people think their job interview questions are much easier than they really are; things just seem obvious when you've thought about the solution. Anyone who's read a tutorial, felt like they understood it well, and then struggled for a while to actually start using the tool to make something new knows the feeling very well. That Todo List app in the tutorial seemed so simple, but the author was making a bunch of decisions constantly that you didn't have to think about as you read it.
So I guess my question would be: If you were on a plane flight with no wifi, and you wanted to do some dev work locally on your laptop, how comfortable would you be vs if you had done all that work yourself rather than via Claude?
I've worked with people who will look at code they don't understand, say "llm says this", and express zero intention of learning something. Might even push back. Be proud of their ignorance.
It's like, why even review that PR in the first place if you don't even know what you're working with?
A good dev would've read deeper into the concern and maybe noticed potential flaws, and if he had his own doubts about what the concern was about, would have asked for more clarification. Not just feed a concern into AI and fling it back. Like please, in this day and age of AI, have the benefit of the doubt that someone with a concern would have checked with AI himself if he had any doubts of his own concern...
We have gone multi cloud disaster recovery on our infrastructure. Something I would not have done yet, had we not had LLMs.
I am learning at an incredible rate with LLMs.
But I’m so much more detached of the code, I don’t feel that ‘deep neural connection’ from actual spending days in locked in a refactor or debugging a really complex issue.
I don’t know how a feel about it.
Could you do it again without the help of an LLM?
If no, then can you really claim to have learned anything?
I don't believe it. Having something else do the work for you is not learning, no matter how much you tell yourself it is.
That’s product atrophy, not skill atrophy.
And not even just understanding, but verifying that they’ve implemented the optimal solution.
What an interesting paradox-like situation.
When future humans rediscover mathematics.
And don’t get me started on memory management. Nobody even knows how to use malloc(), let alone brk()/mmap(). Everything is relying on automatic memory management.
I mean when was the last time you actually used your magnetized needle? I know I am pretty rusty with mine.
Yeah, exactly.
It’s like saying clothing manufacturers are paying the “loom tax” tax when they could have been weaving by hand…
Where producing 2x the t-shirts will get you ~2x the revenue, it's quite unlikely that 10x the code will get you even close to 2x revenue.
With how much of this industry operates on 'Vendor Lock-in' there's a very real chance the multiplier ends up 0x. AI doesn't add anything when you can already 10x the prices on the grounds of "Fuck you. What are you gonna do about it?"
Open source libraries and projects together with open source AI is the only way to avoid the existential risks of closed source AI.
I don't know about 10x, but this could only happen if PMs suddenly got really lazy or the engineers actually got at least 1.5x faster. My gut says it's way more because we're now also consistently up to date on our dependencies and completing massive refactors we were putting off for years.
There are lots of reasons this could be the case. Quality suddenly changed, the nature of the work changed, engineers leveled up... But for this to have happened consistently across a bunch of engineering teams is quite the coincidence if not this one thing we are all talking about.
The evangelists told us 20 years ago that if we weren't doing TDD then we weren't really professional programmers at all. The evangelists told us 10 years ago that if we were still running stuff locally then we must be paying a fortune for IT admin or not spending our time on the work that mattered. The evangelists this week tell us that we need to be using agents to write all our code or we'll get left in the dust by our competitors who are.
I'm still waiting for my flying car. Would settle for some graphics software on Linux that matches the state of the art on Windows or even reliable high-quality video calls and online chat rooms that don't make continental drift look fast.
This doesn't happen. Literally zero evidence of this.
Frontier labs are incentivized to keep it that way, and they're investing billions to make AI = API the default. But that's a business model, not a technical inevitability.
ive had to like tune out of the LLM scene because it's just a huge mess. It feels impossible to actually get benchmarks, it's insanely hard to get a grasp on what everyone is talking about, bots galore championing whatever model, it's just way too much craze and hype and misinformation. what I do know is we can't keep draining lakes with datacenters here and letting companies that are willing to heel turn on a whim basically control the output of all companies. that's not going to work, we collectively have to find a way to make local inference the path forward.
everyone's foot is on the gas. all orgs, all execs, all peoples working jobs. there's no putting this stuff down, and it's exhausting but we have to be using claude like _right now_. pretty much every company is already completely locked in to openai/gemini/claude and for some unfortunate ones copilot. this was a utility vendor lock in capture that happened faster than anything ive ever seen in my life & I already am desperate for a way to get my org out of this.
My manager doesn't even want us to use copilot locally. Now we are supposed to only use the GitHub copilot cloud agent. One shot from prompt to PR. With people like that selling vendor lock in for them these companies like GitHub, OpenAI, Anthropic etc don't even need sales and marketing departments!
But it requires that one does not do something stupid.
Eg. For recurring tasks: keep the task specification in the source code and just ask Claude to execute it.
The same with all documentation, etc.
The open model mentality is also just so bizarre to me. You're going to use an inferior model to save, what, a couple hundred bucks a month? Is your time really worth that little?
No one working on a serious project at a serious company is downgrading their agent's intelligence for a marginal cost saving. Downgrading your model is like downgrading the toilet paper on your yacht.
5.1 is like $4 / 1m output, Opus 4.6 is $25. GPT 5.4 pro is $270 with large contexts :O
I've said it before and I'll say it again, local models are "there" in terms of true productive usage for complex coding tasks. Like, for real, there.
The issue right now is that buying the compute to run the top end local models is absurdly unaffordable. Both in general but also because you're outbidding LLM companies for limited hardware resources.
You have a $10K budget, you can legit run last year's SOTA agentic models locally and do hard things well. But most people don't or won't, nor does it make cost effective sense Vs. currently subsidized API costs.
Early last year or late last year?
opus 4.5 was quite a leap
I fear that this may not be feasible in the long term. The open-model free ride is not guaranteed to continue forever; some labs offer them for free for publicity after receiving millions in VC grants now, but that's not a sustainable business model. Models cost millions/billions in infrastructure to train. It's not like open-source software where people can just volunteer their time for free; here we are talking about spending real money upfront, for something that will get obsolete in months.
Current AI model "production" is more akin to an industrial endeavor than open-source arrangements we saw in the past. Until we see some breakthrough, I'm bearish on "open models will eventually save us from reliance on big companies".
If you mean obsolete in the sense of "no longer fit for purpose" I don't think that's true. They may become obsolete in terms of "can't do hottest new thing" but that's true of pretty much any technology. A capable local model that can do X will always be able to do X, it just may not be able to do Y. But if X is good enough to solve your problem, why is a newer better model needed?
I think if we were able to achieve ~Opus 4.6 level quality in a local model that would probably be "good enough" for a vast number of tasks. I think it's debatable whether newer models are always better - 4.7 seems to be somewhat of a regression for example.
1. Opencode
2. Fireworks AI: GLM 5.1
And it is SIGNIFICANTLY cheaper than Claude. I'm waiting eagerly for something new from Deepseek. They are going to really show us magic.
model elo $/M
---------------------------------------
glm-5.1 1538 2.60
glm-4.7 1440 1.41
minimax-m2.7 1422 0.97
minimax-m2.1-preview 1392 0.78
minimax-m2.5 1386 0.77
deepseek-v3.2-thinking 1369 0.38
mimo-v2-flash (non-thinking) 1337 0.24
https://arena.ai/leaderboard/code?viewBy=plot&license=open-s...I don't know if it is bun related, but in task manager, is the thing that is almost at the top always on CPU usage, turns out for me, bun is not production ready at all.
Wish Zed editor had something like BigPickle which is free to use without limits.
Google just released Gemma 4, perhaps that'd be worth a try?
If you have HPC or Supercompute already, you have much of the expertise on staff already to expand models locally, and between Apple Silicon and Exo there are some amazingly solutions out there.
Now, if only the rumors about Exo expanding to Nvidia are true..
Training and inference costs so we would have to pay for them.
I think companies that are shelling out the money for these enterprise accounts could honestly just buy some H100 GPUs and host the models themselves on premises. Github CoPilot enterprise charges $40 per user per month (this can vary depending on your plan of course), but at this price for 1000 users that comes out to $480,000 a year. Maybe I'm missing something, but that's roughly what you're going to be spending to get a full fledged hosting setup for LLMs.
made a HN post of my X article on the lock-in factor and how we should embrace the modular unix philosophy as a way out: https://news.ycombinator.com/item?id=47774312
I'm still surprised top CS schools are not investing in having their students build models, I know some are, but like, when's the last time we talked about a model not made by some company, versus a model made by some college or university, which is maintained by the university and useful for all.
It's disgusting that OpenAI still calls itself "Open AI" when they aren't truly open.
Sticking with codex. Also GPT 5.5 is set to come next week.
I think people aren’t reading the system cards when they come out. They explicitly explain your workflow needs to change. They added more levels of effort and I see no mention of that in this post.
Did y’all forget Opus 4? That was not that long ago that Claude was essentially unusable then. We are peak wizardry right now and no one is talking positively. It’s all doom and gloom around here these days.
How about - don't break my workflow unless the change is meaningful?
While we're at it, either make y in x.y mean "groundbreaking", or "essentially same, but slightly better under some conditions". The former justifies workflow adjustments, the latter doesn't.
I'm surprised that it's 45%. Might go down (?) with longer context answers but still surprising. It can be more than 2x for small prompts.
If I can have Claude write up the plan, and the other models actually execute it, I'd get the best of both worlds.
(Amusingly, I think Codex tolerates being invoked by Claude (de facto tolerated ToS violation), but not the other way around.)
You could nonetheless have Codex write up the plan to an .md file for Claude (perhaps Sonnet or even Haiku?) to execute.
If tech companies convince Congress that AI is an existential issue (in defense or even just productivity), then these companies will get subsidies forever.
And shafting your customers too hard is bad for business, so I expect only moderate shafting. (Kind of surprised at what I've been seeing lately.)
So far, Opus 4.7 seems a bit smarter than Opus 4.6 for my use case. That's my only concern. Is an $80 bottle of wine a better value than a $20 or $40 bottle of wine? Pretty much never. If there are those of us willing to buy $80 bottles of wine, of course the market will facilitate this.
People can use whatever model they want. I'm too worried about worms crawling through my dead body to waste time on any but the smartest model any moment can offer.
And what's missing in all these token count complaints is that 4.7 is actually cheaper overall anyways because it produces fewer output tokens.
Plenty of OSS models being released as of late, with GLM and Kimi arguably being the most interesting for the near-SOTA case ("give these companies a run for their money"). Of course, actually running them locally for anything other than very slow Q&A is hard.
This gives me hope that even if future versions of Opus continue to target long-running tasks and get more and more expensive while being less-and-less appropriate for my style, that a competitor can build a model akin to Opus 4.5 which is suitable for my workflow, optimizing for other factors like cost.
Is Opus 4.7 that significantly different in quality that it should use that much more in tokens?
I like Claude and Anthropic a lot, and hope it's just some weird quirk in their tokenizer or whatnot, just seems like something changed in the last few weeks and may be going in a less-value-for-money direction, with not much being said about it. But again, could just be some technical glitch.
Our default topology is a two-agent pair: one implementer and one reviewer. In practice, that usually means Opus writing code and Codex reviewing it.
I just finished a 10-hour run with 5 of these teams in parallel, plus a Codex run manager. Total swarm: 5 Opus 4.7 agents and 6 Codex/GPT-5.4 agents.
Opus was launched with:
`export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=35 claude --dangerously-skip-permissions --model 'claude-opus-4-7[1M]' --effort high --thinking-display summarized`
Codex was launched with:
`codex --dangerously-bypass-approvals-and-sandbox --profile gpt-5-4-high`
What surprised me was usage: after 10 hours, both my Claude Code account and my Codex account had consumed 28% of their weekly capacity from that single run.
I expected Claude Code usage to be much higher. Instead, on these settings and for this workload, both platforms burned the same share of weekly budget.
So from this datapoint alone, I do not see an obvious usage-efficiency advantage in switching from Opus 4.7 to Codex/GPT-5.4.
First they introduce a policy to ban third party clients, but the way it's written, it affects claude -p too, and 3 months later, it's still confusing with no clarification.
Then they hide model's thinking, introduce a new flag which will still show summaries of thinking, which they break again in the next release, with a new flag.
Then they silently cut the usage limits to the point where the exact same usage that you're used to consumes 40% of your weekly quota in 5 hours, but not only they stay silent for entire 2 weeks - they actively gaslight users saying they didn't change anything, only to announce later that they did, indeed change the limits.
Then they serve a lobotomized model for an entire week before they drop 4.7, again, gaslighting users that they didn't do that.
And then this.
Anthropic has lost all credibility at this point and I will not be renewing my subscription. If they can't provide services under a price point, just increase the price or don't provide them.
EDIT: forgot "adaptive thinking", so add that too. Which essentially means "we decide when we can allocate resources for thinking tokens based on our capacity, or in other words - never".
Having a taste of unnerfed Opus 4.6 I think that they have a conflict of interest - if they let models give the right answer first time, person will spend less time with it, spend less money, but if they make model artificially dumber (progressive reasoning if you will), people get frustrated but will spend more money.
It is likely happening because economics doesn't work. Running comparable model at comparable speed for an individual is prohibitively expensive. Now scale that to millions of users - something gotta give.
It’s funny everyone says “the cost will just go down” with AI but I don’t know.
We need to keep the open source models alive and thriving. Oh, but wait the AI companies are buying all the hardware.
To me this seems more that it's trained to be concise by default which I guess can be countered with preference instructions if required.
What's interesting to me is that they're using a new tokeniser. Does it mean they trained a new model from scratch? Used an existing model and further trained it with a swapped out tokeniser?
The looped model research / speculation is also quite interesting - if done right there's significant speed up / resource savings.
It's going to be a very expensive game, and the masses will be left with subpar local versions. It would be like if we reversed the democratization of compilers and coding tooling, done in the 90s and 00s, and the polished more capable tools are again all proprietary.
So over time older models will be less valuable, but new models will only be slightly better. Frontier players, therefore, are in a losing business. They need to charge high margins to recoup their high training costs. But latecomers can simply train for a fraction of the cost.
Since performance is asymptomatic, eventually the first-mover advantage is entirely negligible and LLMs become simple commodity.
The only moat I can see is data, but distillation proves that this is easy to subvert.
There will probably be a window though where insiders get very wealthy by offloading onto retail investors, who will be left with the bag.
There hasn't been a real Moore's law for a good while even before LLMs.
And memory isn't getting less expensive either...
Oh well
OpenAI was built as you say. Google had a corporate motto of "Don't be evil" which they removed so they could, um, do evil stuff without cognitive dissonance, I guess.
This is the other kind of enshitification where the businesses turn into power accumulators.
You could call it a rug pull, but they may just be doing the math and realize this is where pricing needs to shift to before going public.
Not a secret, the model is the best on the world. Yet it is crazy expensive and this 35% is huge for us. $10,000 becomes $13,500. Don’t forget, anthropic tokenizer also shows way more than other providers.
We have experimented a lot with GLM 5.1. It is kinda close, but with downsides: no images, max 100K adequate context size and poor text writing. However, a great designer. So there is no replacement. We pray.
It was on the higher end of Anthropics range - closer to 30-40% more tokens
https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...
what makes it worse is it compounds with two other things: thinking tokens (invisible but counted against limits) and the more verbose output style. so the effective cost delta is closer to 1.5-2x, not just the 1.35x from the tokenizer alone.
practically the only mitigation right now is to keep using 4.6 for tasks where you don't need the reasoning improvements and only use 4.7 when you actually need it. but that means maintaining model selection logic per-task, which most people won't bother with.
Maybe I missed it, but it doesn’t tell you if it’s more successful for less overall cost?
I can easily make Sonnet 4.6 cost way more than any Opus model because while it’s cheaper per prompt it might take 10x more rounds (or never) solve a problem.
That's an incentive difficult to reconcile with the user's benefit.
To keep this business running they do need to invest to make the best model, period.
It happens to be exactly what Anthropic's strategy is. That and great tooling.
And they're selling less and less (suddenly 5 hour window lasts 1 hour on the similar tasks it lasted 5 hours a week ago), so IMO they're scamming.
I hope many people are making notes and will raise heat soon.
Anthropic has to keep racing ahead and be stamped offering the best frontier models.
It isn't optimal, so the models cost them disproportionately too much to sell at a profitable price. So they keep feeding the hype and push the costs higher, hoping there won't be too much heat and get away with it.
I wouldn't like to be a leader at such company, but their pay keep them in line.
The difference here is Opus 4.7 has a new tokenizer which converts the same input text to a higher number of tokens. (But it costs the same per token?)
> Claude Opus 4.7 uses a new tokenizer, contributing to its improved performance on a wide range of tasks. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and /v1/messages/count_tokens will return a different number of tokens for Claude Opus 4.7 than it did for Claude Opus 4.6.
> Pricing remains the same as Opus 4.6: $5 per million input tokens and $25 per million output tokens.
ArtificialAnalysis reports 4.7 significantly reduced output tokens though, and overall ~10% cheaper to run the evals.
I don't know how well that translates to Claude Code usage though, which I think is extremely input heavy.
What I've been doing is running a dual-model setup — use the cheaper/faster model for the heavy lifting where quality variance doesn't matter much, and only route to the expensive one when the output is customer-facing and quality is non-negotiable. Cuts costs significantly without the user noticing any difference.
The real risk is that pricing like this pushes smaller builders toward open models or Chinese labs like Qwen, which I suspect isn't what Anthropic wants long term.
There are 2 things to consider:
* Time to market.
* Building a house on someone else's land.
You're balancing the 2, hoping that you win the time to market, making the second point obsolete from a cost perspective, or you have money to pivot to DIY.This is going to be blunt, but this business model is fundamentally unsustainable and "founders" don't get to complain their prospecting costs went up. These businesses are setting themselves up to get Sherlocked.
The only realistic exit for these kinds of businesses is to score a couple gold nuggets, sell them to the highest bidder, and leave.
A smaller builder might reconsider (re)acquiring relevant skills and applying them. We don't suddenly lose the ability to program (or hire someone to do it) just because an inference provider is available.
latest claude still fails the car wash test
>I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.
Claude design on the other hand seemed to eat through (its own separate usage limit) very fast. Hit the limit this morning in about 45 mins on a max plan. I assume they are going to end up spinning that product off as a separate service.
To be clear, I'm not saying that it's a good thing, but it does seem to be going in this direction.
And junior devs have never added much value. The first two years of any engineer’s career is essentially an apprenticeship. There’s no value add from have a perpetually junior “employee”.
Under the hood, what was happening is that older models needed reminders, while 4.7 no longer needs it. When we showed these reminders to 4.7 it tended to over-fixate on them. The fix was to stop adding cyber reminders.
More here: https://x.com/ClaudeDevs/status/2045238786339299431
> 4.7 is quite... dumb. i think they have lobotomized this model
Is adaptive thinking still broken? Why was the option to disable it taken away?
Also there should be time distribution for the queries and a way to filter by query time. This is because Anthropic is reported to change the model quality arbitrarily in the background.
Also there is no unit in table column headers. For example "Request 4.7" is this the amount of tokens 4.7 consumes? Is it output/input/reasoning etc.
Really difficult to make sense of this.
People get offended if what they are doing is labeled as slop but this is unfortunately the level of quality I expect from AI related content or code.
This has resulted in +92.9% cost and token difference. Submission bd2457e5, currently at the top of the leaderboard.
It looks like you don't allow testing of anything beyond a certain token size.
Which makes your test kind of pointless, because if you are chatting about AI with something that's only a few hundred tokens, the data your collecting is pretty minimal and specific, not something that's generally applicable or relevant to wider user outside of that specific context.
In my opinion, we've reached some ceiling where more tokens lead only to incremental improvements. A conspiracy seems unlikely given all providers are still competing for customers and a 50% token drives infra costs up dramatically too.
The whole magic of (pre-nerfed) 4.6 was how it magically seemed to understand what I wanted, regardless of how perfectly I articulated it.
Now, Anth says that needing to explicitly define instructions are as a "feature"?!
what bugs me is the tokenizer change feels like a stealth price hike. if you're charging the same $/token but the same text now costs 35% more tokens, thats just a 35% price increase with extra steps. at least be upfront about it.