I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).
So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.
Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.
But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.
There's orders of magnitude of low hanging juice to squeeze out of smaller models.
It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).
It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.
Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...
You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.
Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.
There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...
(G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.
I agree but with their urgent IPO-driven need to keep increasing prices, the frontier vendors now have every incentive maintain the perception that frontier performance requires endless >$200K racks of unobtanium GPUs and RAM. While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret. Otherwise, they can't maintain and keep increasing their prices. And post-IPO audited reporting makes keeping that secret even harder.
Game theory-wise they probably don't want their their armies of leading researchers optimizing frontier performance, at least in any way that would further accelerate the relative price/perf of smaller models or self/cloud-hosting. While they know the open source models will always improve, the still win as long as enough customers demand the latest frontier and the open source lag remains constant.
They profit most in a world where a few frontier labs stay far in front, drag-racing each other and expending vast capital. It keeps their customers reliant and paying top dollar while keeping low-cost alternatives farther back. They probably much prefer competing with a couple other frontier labs who have similar astronomical costs and biz models, than a world where self or cloud-hosted open-source models start closing the gap enough to start commoditizing their business.
I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
I'm curious if someone here with a stronger background in the space has a similar intuition or not.
- this gets reinvented/rediscovered constantly under different names
- it cant be trained very well (right now, will change)
- massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)
- BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used
I follow this stuff closely, I think I know what I'm talking about (edited for formating)
Most software engineers will just need cheap tokens.
But things like physics and drug discovery have no foreseeable upper bound.
With that said, they are now hitting the walls of energy costs and memory shortages. You brain uses 20W -- don't take it as an insult. There are orders of magnitude to gain from producing energy-efficient models (or model runners).
So I am expecting same performance at lower costs for the coming years.
If you subscribe to things like "there are tasks LLMs are innately bad at due to insufficient depth and lack of recurrent capability", then GRAM might be another signal towards that.
But keep in mind: even ARC-AGIs have their frontiers dominated by LLMs. Even if "innately bad" is true, it clearly doesn't go all the way to "innately incapable".
Given how well Qwen3.6-27B performs for such a small model I think you could be right. I suspect that Google,OpenAI,Anthropic must be looking at the Qwen3.6 models (as well as Deepseek V4-flash, MiMo-V2.5) and wondering if they could make some smaller models that are specifically trained for certain activities - like coding. Smaller, more targeted models would take up a lot less resources.
I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.
The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.
Even as humans there's so much knowledge out there that exists but it's very hard to surface unless you know exactly what you're looking for beforehand.
What insight do you have to make this claim?
I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.
> It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.
There's a lot of room for improving the smaller models at many levels of the stack.
the last?!? I'm excited to see :) I'll take the other side of that since llms are so new
The benchmarks need to change. The current coding benchmarks don't capture the realities of software engineering.
I had a bunch of images that got masked by some logic, I had to evaluate something on the original images, Claude 4.7 decided to inpaint the masked images instead of just fetching the actual unmasked images from upstream.
I had another model once that decided that because it couldn't figure out how to fill out a form to log into HuggingFace to download weights for some open source model that it was going to instantiate the model with random weights and run inference on a thousand images.
Its coding was fine, but the solution was not the right one.
i think it'll be more like we get 1-10T models and then distill those down into smaller models, though
It seems like the best small models today are all distilled from bigger models
Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos
A) I reckon it's true that smaller models will continue to improve massively through optimization and better and better harnesses, this tech is all still very young and A LOT of resources and (good-)will is being thrown at it.
B) The 1T+ models will be able to sideload and improve upon a lot of the fundamental improvements that happen to the smaller models to speed up incredibly while getting better at tools while (on a gradient) getting -more- things right.
C) More of an observation that I think is worth keeping in mind clearly; Karl Popper's black swan and all, truth in our temporal world IS a gradient!
Most software engineers will just need cheap tokens.
But things like physics and drug discovery have no forseeable upper bound.
But there is a ton of juice left to squeeze when it comes to post-training/RL for a ton of useful things in practice, right? It’s been amazing seeing how good modern model tool use is for example, and I bet there is a lot of room for improvement still (no doubt that a ton of improvement can be made more easily on the agent harness front or via post-training regimes like LoRa (which does support to your point about diminishing pre-training juice))
Where do I find papers like this? Outside of hacker news comments. It's so hard to find the good stuff in all the noise IMO.
I can see a LOT of room to explore and partition domains into more specified models still.
I couldn’t even imagine having to go back to a model from 12 months ago, much less 24 months ago. GPT-5.5 is so much better than GPT-4o that it sure seems like they keep finding new juice to squeeze.
This is like going from dialup internet to DSL and acting like it has peaked before gigabit cable and fiber come along. We are at the beginning of hardware truly made for AI.
Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.
We have so many ways of optimizing:
- continusly creating more and better training data
- increasing parameters to 20/50/100TB
- We still wait for Mythos access
- We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)
- Reinforcment learning and evolutionary algortihm only started to appear
- If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones
- We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around
- Research for Diffusion and other models is still in progress
- Nvidia just announced/showed a 7x speedup on inferencing for Nemotron
- Multitoken prediction became available just a few weeks ago
- Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)
- World models are showing great progress and we do not know yet what they will bring to the table
- They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity
- We see more and more mulit modal models (these also consume compute)
- N-Gram paper and co i have not seen all of these things in chinese open models
- We don't even know yet what Meta is doing, but we do know they restarted their efforts again
- Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations
- We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.
- We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this
- Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness
- ChatGPTs Image model 2.0 got relevant better and came out just a month ago
I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.
Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.
There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.
I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.
Graphic RAM?
My conspiracy theory is that Apple recognizes this.
My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.
But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.
I was one of these people that Claude would never finish anything and just randomly say, this is a good stopping point, I think you should go to bed.
And then I'd tell it to continue, and it would burn tons of tokens, make no progress and say, "This is a really good stopping point..."
Canceled and switched to Codex and have been pretty happy with it. It doesn't plan as well as Claude, but I think it does better implementation - and neither of them can actually come up with good plans without a ton of help...
Codex is also way faster.
There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.
I have limited enterprise budget and Claude 4.7 costs 7x more. So unless there's close to 7x improvement, it doesn't make sense to switch to 4.7.
I actually gave both 4.6 a really complex task. It kept on thinking for several minutes before I hit the brakes. I then gave 4.7 the same task, and didn't notice any difference in behavior. Clearly not worth the 7x premium.
I hope 4.6 becomes cheaper/free at some point because I'm starting to see a push towards optimizing token expenditures across the board. While frontier models are still the default for developing new workflows, everybody is starting to ask how to automate repetitive tasks without using tokens.
Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.
4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.
So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.
It still seems trying to build general models is mostly cost prohibitive - the frontier model provider and resellers are repricing in such a way the return on investment is dropping as developers and users become more cautious of burning their limits.
I'm still of the opinion that models like 4.6 don't need to be improved on - rather they need to be better integrated with more domain specific models in agentic flows.
https://platform.claude.com/docs/en/about-claude/pricing
``` Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens
Claude Opus 4.8 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.7 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.6 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.5 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.1 $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok
Claude Opus 4 (deprecated) $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok
Claude Sonnet 4.6 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Sonnet 4.5 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Sonnet 4 (deprecated) $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Haiku 4.5 $1 / MTok $1.25 / MTok $2 / MTok $0.10 / MTok $5 / MTok
Claude Haiku 3.5 (retired, except on Bedrock and Vertex AI) $0.80 / MTok $1 / MTok $1.60 / MTok $0.08 / MTok $4 / MTok ```
Are the dividing lines around personality? Working domains? Opinionated software stuff?
Who knows?
I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.
Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.
I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.
I also recently moved to 4.6 since I started hitting the context limit too often with my current project.
This one change will probably solve 80% of the problems you have noticed.
Data at https://gertlabs.com/rankings
In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.
4.1 they made it much faster, so a lot of infra improvements.
4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.
4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.
4.7 they just fixed the bugs they added in 4.6. Better than 4.5.
haven't fully tested 4.8 yet.
It's just amusing reading all these posts with different viewpoints, just in this thread there are multiple people saying 4.6 was so much better than 4.7 and that they switched back to 4.6.
You won't, really.
It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.
It might be saturated for smaller scopes of work, but it’s not hard to see the cracks when you scale up what you ask of SOTA models/agents.
One example, to try and single shot prompt coding a ChatGPT equivalent chatbot.
Sure it will spit something out, but the feature depth, UX subtitles, backend integration, and lots of pragmatic engineering decisions along the way will just not be baked.
Another example is building a C compiler from scratch which Anthropic showed is still a struggle to do.
Not that these these specific examples are important but just to point out scaling up expectations shows the cracks.
It’s not just a model problem of course, better agents, orchestration features (like Dynamic Workflows mentioned in the post), all need to continue to evolve.
Ar what point does my CS degree become totally useless is an open question.
Why are you people saying all these things.
We'll probably see long-distance space travel long before a degree in generic problem identification and solving becomes totally useless.
A few days? A few weeks? Longer?
However a company releases a new AI model and within hours users are confidently proclaiming how much smarter it is than previous versions.
A lot of the information (blogs, tweelches, plosts) that I consume seems to be converging on the idea that we all depend on the models. However. It seems to me that the exact opposite is true. The models depend on us, and _desperately_ so.
There must have been stories, books, movies, made about this intellectual (and propositional, legal, factual) inversion.
The majority need the minority. Has always been the case, I now think. But what has newly developed is that the majority can take a dependency not on the minority, but on a select few companies who are abstracting and compressing the minority into latent spaces.
How do I know? Because when pushing both to generate code or in independent chats to analyze projects, 5.5 will consistently find all the bugs that Claude does not find, and when challenged, Claude does agree those bugs were there. And my findings match those.
When from a blank start asking Claude to analyze project A and Project B,. Clause will consistently say project B is the better structured, more robust, and more defect free and does justify it. And project B was the one created by GPT 5.5....And also the one I judge to be the best one.
And yes, both at deep effort settings and starting from same specs...
Greetings to the Anthropic office good sirs btw.
EX. You call an orchestration agent and define an implementation plan with the help of a number of sub agents planning out different features. You and the lead agent review all of the plans and send them off to a set of agents that write tests which get send back to the orchestrator then passed along with the plan to a set of coding agents who implement the features in their own worktrees. That gets passed back to the orchestrator which hands it off to another set of agents doing the code review and merging the features before sending it back to you.
I thinks there's a big push to get these companies in a state where they can be dumped on public markets.
I genuinely hope that you're joking with that statement.
Or this is a bot.
Or an ARG.
Or Art.
Help.
I feel like I get to know a model in the human sense of understanding a personality. Yesterday I knew 4.6 extended, today it's different, there's multiple "token budget" levels. I just want 4.6 extended back as it was, I was getting on well with it / them.
They mention more granular control of effort, 'dynamic workflows' and more speed controls ("fast mode"). While they position them as user features, they also sound like the kinds of knobs Anthropic will need to twiddle on the back-end to balance costs, margins, ARR, and user growth vs retention post-IPO to hit key metrics in quarterly reporting.
I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.
Now that they have Colossus capacity, I guess they can tune up the intelligence again and spend more tokens on reasoning budgets.
4.7 was definitely a lot more flaky for me vs. 4.6 before the reasoning bugs.
I have ONLY heard negative feedback about it, and trying it myself also yielded really awful results.
It's kind of like how the consumer laptop market is now. I was telling my boss today that most employees wouldn't see any noticeable performance difference between a macbook pro and a neo if they are just doing admin stuff on the web.
Also, the biggest factor is having a good planning phase. A good plan is better than even major model improvements.
You don't have to correct it dozens of times a day!? Really?
If the hype train keeps going for another year, Sam and co will have to resort to direct gaslighting like saying the model is improving but nobody can feel it anymore, oh and I need 10 trillion dollars
i still havent really noticed it per set being better
This felt particularly visible during the 4.6 when people said that 4.6 felt dumber and I remember someone doing some analysis and it sort of proved that models were getting dumber over time.
This has both benefits of costing less for the company to run while taking a standard subscription but also, at the same time, making the next model when it drops to public to "feel" more good comparatively.
Again, I am not sure if this is the case or not but merely proposing something that I feel like it might be in the possibility of realm.