Issue: Claude Code is unusable for complex engineering tasks with Feb updates (opens in new tab)

(github.com)

1364 pointsStanAngeloff1mo ago754 comments

754 comments

Hey all, Boris from the Claude Code team here. I just responded on the issue, and cross-posting here for input.

---

Hi, thanks for the detailed analysis. Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.

There's a lot here, I will try to break it down a bit. These are the two core things happening:

> `redact-thinking-2026-02-12`

This beta header hides thinking from the UI, since most people don't look at it. It *does not* impact thinking itself, nor does it impact thinking budgets or the way extended reasoning works under the hood. It is a UI-only change.

Under the hood, by setting this header we avoid needing thinking summaries, which reduces latency. You can opt out of it with `showThinkingSummaries: true` in your settings.json (see [docs](https://code.claude.com/docs/en/settings#available-settings)).

If you are analyzing locally stored transcripts, you wouldn't see raw thinking stored when this header is set, which is likely influencing the analysis. When Claude sees lack of thinking in transcripts for this analysis, it may not realize that the thinking is still there, and is simply not user-facing.

> Thinking depth had already dropped ~67% by late February

We landed two changes in Feb that would have impacted this. We evaluated both carefully:

1/ Opus 4.6 launch → adaptive thinking default (Feb 9)

Opus 4.6 supports adaptive thinking, which is different from thinking budgets that we used to support. In this mode, the model decides how long to think for, which tends to work better than fixed thinking budgets across the board. `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING` to opt out.

2/ Medium effort (85) default on Opus 4.6 (Mar 3)

We found that effort=85 was a sweet spot on the intelligence-latency/cost curve for most users, improving token efficiency while reducing latency. On of our product principles is to avoid changing settings on users' behalf, and ideally we would have set effort=85 from the start. We felt this was an important setting to change, so our approach was to:

1. Roll it out with a dialog so users are aware of the change and have a chance to opt out

2. Show the effort the first few times you opened Claude Code, so it wasn't surprising.

Some people want the model to think for longer, even if it takes more time and tokens. To improve intelligence more, set effort=high via `/effort` or in your settings.json. This setting is sticky across sessions, and can be shared among users. You can also use the ULTRATHINK keyword to use high effort for a single turn, or set `/effort max` to use even higher effort for the rest of the conversation.

Going forward, we will test defaulting Teams and Enterprise users to high effort, to benefit from extended thinking even if it comes at the cost of additional tokens & latency. This default is configurable in exactly the same way, via `/effort` and settings.json.

Wowfunhappy1mo ago

> Under the hood, by setting this header we avoid needing thinking summaries, which reduces latency. You can opt out of it with `showThinkingSummaries: true` in your settings.json (see [docs](https://code.claude.com/docs/en/settings#available-settings)).

Can I just see the actual thinking (not summarized) so that I can see the actual thinking without a latency cost?

I do really need to see the thinking in some form, because I often see useful things there. If Claude is thinking in the wrong direction I will stop it and make it change course.

3 more replies

richardjennings1mo ago

I was not aware the default effort had changed to medium until the quality of output nosedived. This cost me perhaps a day of work to rectify. I now ensure effort is set to max and have not had a terrible session since. Please may I have a "always try as hard as you can" mode ?

5 more replies

johndough1mo ago

I think it is hilarious that there are four different ways to set settings (settings.json config file, environment variable, slash commands and magical chat keywords).

That kind of consistency has also been my own experience with LLMs.

8 more replies

koverstreet1mo ago

There's been more going on than just the default to medium level thinking - I'll echo what others are saying, even on high effort there's been a very significant increase in "rush to completion" behavior.

3 more replies

plexicle1mo ago

Ultrathink is back? I thought that wasn't a thing anymore.

If I am following.. "Max" is above "High", but you can't set it to "Max" as a default. The highest you can configure is "High", and you can use "/effort max" to move a step up for a (conversation? session?), or "ultrathink" somewhere in the prompt to move a step up for a single turn. Is this accurate?

2 more replies

potsandpans1mo ago

For anyone reading this and wondering where the truth could possibly be:

We can't really know what the truth is, because Anthropic is tightly controlling how you interact with their product and provides their service through opaque processes. So all we can do is speculate. And in that speculation there's a lot of room (for the company) to bullshit or provide equally speculative responses, and (for outsiders) to search for all plausible explanations within the solution space. So there's not much to action on. We're effectively stuck with imprecise heuristics and vibes.

But consider what we do know: the promise is that Anthropic is providing a black-box service that solves large portions of the SDLC. Maybe all of it. They are "making the market" here, and their company growth depends on this bet. This is why these processes are opaque: they have to be. Anthropic, OpenAI and a few others see this as a zero-sum game. The winner "owns" the SDLC (and really, if they get their way the entire PDLC). So the competitive advantage lies in tightly controlling and tweaking their hidden parameters to squeeze as much value and growth as possible.

The downside is that we're handing over the magic for convenience and cost. A lot of people are maybe rightly criticizing the OP of the issue because they're staking their business on Claude Code in a way that's very risky. But this is essentially what these companies are asking for. The business model end game is: here's the token factory, we control it and you pay for the pleasure of using it. Effectively, rent-seeking for software development. And if something changes and it disrupts your business, you're just using it incorrectly. Try turning effort to max.

Reading responses like this from these company representatives makes me increasingly uneasy because it's indicative of how much of writing software is being taken out from under our feet. The glimmer of promise in all of this though is that we are seeing equity in the form of open source. Maybe the answer is: use pi-mono, a smattering of self hosted and open weights models (gemma4, kimi, minimax are extremely capable) and escalate to the private lab models through api calls when encountering hard problems.

Let the best model win, not the best end to end black box solution.

2 more replies

anonymoushn1mo ago

How do you guys decide which settings should be configurable via environment variables but not settings files and which settings should be configurable via settings files but not environment variables?

3 more replies

robeym1mo ago

This is confusing. ULTRATHINK is a step below /effort max?

ULTRATHINK triggers high effort. /effort max is above high. Calling it ULTRATHINK sounds like it would be the highest mode. If someone has max set and types ULTRATHINK, they're lowering their effort for that turn.

For anyone reading this trying to fix the quality issues, here's what I landed on in ~/.claude/settings.json:

  {
    "env": {
      "CLAUDE_CODE_EFFORT_LEVEL": "max",
      "CLAUDE_CODE_DISABLE_BACKGROUND_TASKS": "1",
      "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1"
    }
  }

The env field in settings.json persists across sessions without needing /effort max every time.

DISABLE_ADAPTIVE_THINKING is key. That's the system that decides "this looks easy, I'll think less" - and it's frequently wrong. Disabling it gives you a fixed high budget every turn instead of letting the model shortchange itself.

3 more replies

w10-11mo ago

Here's the reply in context:

https://github.com/anthropics/claude-code/issues/42796#issue...

Sympathies: Users now completely depend on their jet-packs. If their tools break (and assuming they even recognize the problem). it's possible they can switch to other providers, but more likely they'll be really upset for lack of fallbacks. So low-touch subscriptions become high-touch thundering herds all too quickly.

hansmayer1mo ago

You guys realise you are about 3 months into another one of your CEOs announcements that AI would "write all code in 6 months", right? Based on the problems you are facing, would you say your CEO gave a realistic announcement this time around ?

2 more replies

dc_giant1mo ago

All right so what do I need to do so it does its job again? Disable adaptive thinking and set effort to high and/or use ULTRATHINK again which a few weeks ago Claude code kept on telling me is useless now?

2 more replies

anonymoushn1mo ago

> On of our product principles is to avoid changing settings on users' behalf

Ideally there wouldn't be silent changes that greatly reduce the utility of the user's session files until they set a newly introduced flag.

I happen to think this is just true in general, but another reason it might be true is that the experience the user has is identical to the experience they would have had if you first introduced the setting, defaulting it to the existing behavior, and then subsequently changed it on users' behalf.

aizk1mo ago

How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?

3 more replies

KenoFischer1mo ago

While we have you here, could you fix the bash escaping bug? https://github.com/anthropics/claude-code/issues/10153

mikkom1mo ago

>Going forward, we will test defaulting Teams and Enterprise users to high effort, to benefit from extended thinking even if it comes at the cost of additional tokens & latency.

interesting that you only make this default on those accounts that pay per token while claiming "medium is best for most users"

That decision seems to imply that the thinking change was more about increasing your profits than anything else

1 more reply

taspeotis1mo ago

Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?

1 more reply

ai_slop_hater1mo ago

> This beta header hides thinking from the UI, since most people don't look at it.

I look at it, and I am very upset that I no longer see it.

1 more reply

yubblegum1mo ago

> Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.

"This report was produced by me — Claude Opus 4.6 — analyzing my own session logs. ... Ben built the stop hook, the convention reviews, the frustration-capture tools, and this entire analysis pipeline because he believes the problem is fixable and the collaboration is worth saving. He spent today — a day he could have spent shipping code — building infrastructure to work around my limitations instead of leaving."

What a "fuckin'" circle jerk this universe has turned out to be. This note was produced by me and who the hell is Ben?

2 more replies

migali49g1mo ago

Hi Boris, thanks for addressing this and providing feedback quickly. I noticed the same issue. My question is, is it enough to do /efforts high, or should I also add CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING to my settings?

JohnMakin1mo ago

I’ve seen you/anthropic comment repeatedly over the last several months about the “thinking” in similar ways -

“most users dont look at it” (how do you know this?)

“our product team felt it was too visually noisy”

etc etc. But every time something like this is stated, your power users (people here for the most part) state that this is dead wrong. I know you are repeating the corporate line here, but it’s bs.

4 more replies

Sayrus1mo ago

> If you are analyzing locally stored transcripts, you wouldn't see raw thinking stored when this header is set, which is likely influencing the analysis. When Claude sees lack of thinking in transcripts for this analysis, it may not realize that the thinking is still there, and is simply not user-facing.

Claude often fetches past transcript for information after compaction. Wouldn't this effectively distort the view it has of past discussions?

DennisL1231mo ago

Happy to have my mind changed, yet I am not 100% convinced closing the issue as completed captures the feedback.

1 more reply

hedora1mo ago

I tried testing 4.5 opus and 4.6 opus both with “high” thinking. Same box, same repo. I had them plan a moderate complexity refactoring on a small codebase.

Observations:

4.6 had previously failed to the point where I had to wipe context. It must have written memories because it was referring to the previous conversation.

As the article points out, 4.6 went out of its way to be lazy and came up with an unusable plan. It did extra planning to avoid renaming files (the toplevel task description involves reorganizing directories of files).

4.6 took twice as long to respond as 4.5.

I’m treating this as a model regression. 4.6 is borderline unusable. I’ve hit all the issues the article describes.

Also, there needs to be an obvious way to disable memory or something. The current UX is terrible, since once an error or incorrect refusal propagates, there is no obvious recovery path.

Anyway, with think set to high, I see drastically different behavior: much slower and much worse output from 4.6.

1 more reply

starkparker1mo ago

> You can also use the ULTRATHINK keyword to use high effort for a single turn

First I've heard that ultrathink was back. Much quieter walkback of https://decodeclaude.com/ultrathink-deprecated/

1 more reply

giancarlostoro1mo ago

I only ever use high effort, the only thing I've run into sometimes I ask Claude to do every item on a list of items, and not stop until they're all done, it finishes maybe 80% of them then says "I've stopped doing things" for no reasonable reason. I don't need it to run for 18 hours nonstop, but 10 or 20 minutes more it would have kept going for wouldn't have hurt, especially when I am usually on Claude Code during off-hours, and on the Max plan.

Part of me wants to give lower "effort" a try, but I always wind up with a mess, I don't even like using Haiku or Sonnet, it feels like Haiku goofs, Haiku and Sonnet are better as subagent models where Opus tells them what to do and they do it from my experience.

1 more reply

tigershark1mo ago

What change did you release on March 23rd when the subscription limits collapsed and they are still way down compared to what they used to be?

linsomniac1mo ago

Hey Boris, thanks for this reply. I've been kind of scratching my head over this issue, assuming I'm just not doing "complex engineering", because since Opus 4.6 my seat-of-the-pants assessment is that it's a huge improvement. It's been like night and day in my use. Full disclosure: I use high effort for basically everything.

freeqaz1mo ago

I have been wondering if 1 Million token context contributes here also. Compaction is much rarer now. How does that influence model performance? For some tasks I do, I feel like performance is worst now after this. Also Plan mode doesn't seem to wipe context anymore?

1 more reply

niteshpant1mo ago

I added `CLAUDE_CODE_EFFORT_LEVEL=max` to my shell's env so that every session is always effort:max by default

1 more reply

y1n01mo ago

"most users"

Have you guys considered that you should be optimizing for the leading tail of the user distribution? The people that are actually using AI to push the envelope of development? "most users," i.e. the inner 70%, aren't doing anything novel.

hellojimbo1mo ago

The last time I typed ultrathink, i got a prompt saying that you no longer need to type ultrathink

Jenk1mo ago

Claude's settings don't appear to be in sync with the published settings schema[0].

[0]: https://www.schemastore.org/claude-code-settings.json.

sroussey1mo ago

> Roll it out with a dialog so users are aware of the change and have a chance to opt out

Here is the issue. Force a choice instead. Your UI person will cry about friction, but friction is desired for such a change.

matheusmoreira1mo ago

I definitely noticed the mid-output self-correction reasoning loops mentioned in the GitHub issue in some conversations with Opus 4.6 with extended reasoning enabled on claude.ai. How do I max out the effort there?

ting01mo ago

Do you guys realize that everyone is switching to Codex because Claude Code is practically unusable now, even on a Max subscription? You ask it to do tasks, and it does 1/10th of them. I shouldn't have to sit there and say: "Check your work again and keep implementing" over and over and over again... Such a garbage experience.

Does Anthropic actually care? Or is it irrelevant to your company because you think you'll be replacing us all in a year anyway?

1 more reply

diavelguru1mo ago

As soon as that change came through I set the effort to high. Have not regretted it for any coding task. It feels the same as Dec-Jan though now spawning more sub agents which is not a bad thing.

erikpau1mo ago

I'd hate to be that guy, but Opus not a very smart model when the effort is set to anything below high. I think, given the feedback from the community, this would be an obvious signal. However, moving the effort to anything beyond medium is a huge token burn. These issues didn't exist, or at least not this persistent, before the last 2 weeks. I, and perhaps a million or so other developers, would ask you to reconsider this thinking. I understand you need to run a business, but so do we, and Claude Opus is genius with a drinking problem, and you never really know upfront if it's drunk or not, but it's generally quite clear after a few minutes.

Other models, such as K2, GLM-5.1, and "the other one" seem to far less drunk than your approach, and you're losing fans quickly if you keep making these kind of changes to the tools or models.

zenoware1mo ago

> CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING

Why not just give people the abiltiy ot set a default thinking level instead of manually setting it to `max` all the time.

ting01mo ago

Thinking time is not the issue. The issue is that Claude does not actually complete tasks. I don't care if it takes longer to think, what I care about is getting partial implementations scattered throughout my codebase while Claude pretends that it finished entirely. You REALLY need to fix this, it's atrocious.

Jimpulse1mo ago

Thanks for transparency here. Claude code if fun to use again! The thinking is huge when working with Claude as planner.

thomascountz1mo ago

   This beta header hides thinking from the UI, since most people don't look at it.

How is this measured?

1 more reply

saidnooneever1mo ago

did the cost go up, or did you lower costs (token consumptions) for all users and then now want to default enterprise/teams back to normal mode. Because it seems like a long way aroundabout to say now it will cost more for same quality.

j451mo ago

Thanks for the update,

Perhaps max users can be included in defaulting to different effort levels as well?

gnegggh1mo ago

Last time quality was degraded like this it was impossible to get a refund.

weakfish1mo ago

Didn’t ULTRATHINK get deprecated? Last time I typed it I got a warning.

CjHuber1mo ago

I honestly am very disappointed with this. I've only learned about CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING and showThinkingSummaries: true from this post. I've been wondering for a while where the summaries went and am always hoping like roulette that it thinks a lot. No wonder if there suddently is an "adaptive thinking" mode. I would have opted out 2 months ago if it was documented or communicated in any way publicly. Why change behavior without notice or any new user facing settings.

I just googled "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING" and it seems like many people don't know about it.

And ULTRATHINK sets the effort to high, but then there is also /effort max?

1 more reply

raincole1mo ago

> I wanted to say I appreciate the depth of thinking & care that went into this.

The irony lol. The whole ticket is just AI-generated. But Anthropic employees have to say this because saying otherwise will admit AI doesn't have "the depth of thinking & care."

2 more replies

jacquesm1mo ago

Textbook example of how to respond to your customers, kudos.

2 more replies

foofloobar1mo ago

Claude Code and Opus used to do a great job a few months ago. It seemed to get it right more often than not. It seemed to be far better at figuring out what has to be done and getting it right on the first attempt. This is likely model related since Claude Code has received some bug fixes since.

The list of bugs and performance problems appears to keep growing: reduced usage quotas, poor performance with numerous attempts at getting things right, cache invalidation bugs, background requests which have to be disabled explicitly to avoid consuming the quota too fast, Opus appears to be quantized even with high thinking mode, poor tool use with tool search disabled, broken tool search with tool search enabled, laziness, poor planning, poor execution, gets stuck when debugging simple code issues, writes code which isn't required, starts making changes and executing whatever it wants when told to simply prepare a plan for something, it doesn't follow instructions to use agents as told and numerous other issues with following the instructions.

The quota story is atrocious. It's difficult to get anything done with Claude Code due to the quota reduction. The cache invalidation bugs don't help either.

The tool use is also a pain to deal with. It appears to choose tools randomly with or without tool search. It keeps running custom CLI commands when it has instructions to use Makefile targets. It often ingests the output of some command with hundreds of lines of output without discrimination. It often uses lots of bash grep and find commands when it has better tools available to search across files and to use MCP tools which are far more efficient. It ignores MCP tools most of the time.

This doesn't appear to be an issue with the prompt itself. I'll try to fix the system prompt next to work around some of the issues. It seems to not follow instructions and to do whatever it feels like doing. It comes off as one of those Q2-Q3 quantized models from huggingface.

The impact of the cache invalidation issue, reduced quota, poor model performance and Claude Code bugs together have rendered this service almost entirely useless for me. The poor model performance means that many more attempts are required and more requests are made to the Anthropic API. The Claude Code bugs and design lead to cache invalidation more often. This makes the impact of the reduced quota even worse. It makes a lot more API requests because the model doesn't get it right on the first 1-2 attempts or because it chooses less than optimal strategies to find what it's looking for.

The communication and Anthropic's overall handling of the reported bugs and problems hasn't been that good either.

As for the session ID and other things you might request for debugging, there's nothing special here that's not reported widely on every Reddit thread from several subreddits. I use 200k context with Opus and Sonnet. I use high thinking mode because anything less appears to be complete garbage with extremely poor results. I avoid compact in favor of knowledge transfer markdown files.

It'd be great to see Anthropic fix the caching issues, to improve the quality of the model, to address the Claude Code bugs, to sort out the quota fiasco, to improve their communication skills, to communicate more with their customers and to be more proactive overall. I'll take my money elsewhere otherwise.

ctoth1mo ago

[flagged]

6 more replies

nickvec1mo ago

Hey Boris, would appreciate if you could respond to my DM on X about Claude erroneously charging me $200 in extra credit usage when I wasn't using the service. Haven't heard back from Claude Support in over a month and I am getting a bit frustrated.

1 more reply

areoform1mo ago

Hey Boris, thanks for the awesomeness that's Claude! You've genuinely changed the life of quite a few young people across the world. :)

not sure if the team is aware of this, but Claude code (cc from here on) fails to install / initiate on Windows 10; precise version, Windows 10.0.19045 build 19045. It fails mid setup, and sometimes fails to throw up a log. It simply calls it quits and terminates.

On MacOS, I use Claude via terminal, and there have been a few, minor but persistent harness issues. For example, cc isn't able to use Claude for Chrome. It has worked once and only once, and never again. Currently, it fails without a descriptive log or issue. It simply states permission has been denied.

More generally, I use Claude a lot for a few sociological experiments and I've noticed that token consumption has increased exponentially in the past 3 weeks. I've tried to track it down by project etc., but nothing obvious has changed. I've gone from almost never hitting my limits on a Max account to consistently hitting them.

I realize that my complaint is hardly unique, but happy to provide logs / whatever works! :)

And yeah, thanks again for Claude! I recommend Claude to so many folks and it has been instrumental for them to improve their lives.

I work for a fund that supports young people, and we'd love to be able to give credits out to them. I tried to reach out via the website etc. but wasn't able to get in touch with anyone. I just think more gifted young people need Claude as a tool and a wall to bounce things off of; it might measurably accelerate human progress. (that's partly the experiment!)

1 more reply

noxa1mo ago

I'm the author of the report in there. The stop-phrase-guard didn't get attached but here it is: https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

You can watch for these yourself - they are strong indicators of shallow thinking. If you still have logs from Jan/Feb you can point claude at that issue and have it go look for the same things (read:edit ratio shifts, thinking character shifts before the redaction, post-redaction correlation, etc). Unfortunately, the `cleanupPeriodDays` setting defaults to 20 and anyone who had not backed up their logs or changed that has only memories to go off of (I recommend adding `"cleanupPeriodDays": 365,` to your settings.json). Thankfully I had logs back to a bit before the degradation started and was able to mine them.

The frustrating part is that it's not a workflow _or_ model issue, but a silently-introduced limitation of the subscription plan. They switched thinking to be variable by load, redacted the thinking so no one could notice, and then have been running it at ~1/10th the thinking depth nearly 24/7 for a month. That's with max effort on, adaptive thinking disabled, high max thinking tokens, etc etc. Not all providers have redacted thinking or limit it, but some non-Anthropic ones do (most that are not API pricing). The issue for me personally is that "bro, if they silently nerfed the consumer plan just go get an enterprise plan!" is consumer-hostile thinking: if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that. Today there is zero indication from Anthropic that the limitation exists, the redaction was a deliberate feature intended to hide it from the impacted customers, and the community is gaslighting itself with "write a better prompt" or "break everything into tiny tasks and watch it like a hawk same you would a local 27B model" or "works for me <in some unmentioned configuration>" - sucks :/

p1necone1mo ago

The "this test failure is preexisting so I'm going to ignore it" thing has been happening a lot for me lately, it's so annoying. Unless it makes a change and then immediately runs tests and it's obvious from the name/contents that the failing test is directly related to the change that was made it will ignore it and not try to fix.

4 more replies

tomwojcik1mo ago

I can't believe that's where we're at, as software devs. I miss predictable outputs, state machines. All those LLM (prompt) based rules make no sense to me. Same with AI WAL. All of it, at some point, will fail.

4 more replies

thatxliner1mo ago

> is consumer-hostile thinking

I've been saying this with many of my friends but, I feel like it's also probably illegal: you paid for a subscription where you expect X out of, and if they changed the terms of your subscription (e.g. serving worse models) after you paid for it, was that not false advertising? Could we not ask for a refund, or even sue?

2 more replies

Majromax1mo ago

I'm curious about your subscription/API comparison with respect to thinking. Do you have a benchmark for this, where the same set of prompts under a Claude Code subscription result in significantly different levels of effective thinking effort compared to a Claude Code+API call?

Elsewhere in this thread 'Boris from the Claude Code team' alleges that the new behaviours (redacted thinking, lower/variable effort) can be disabled by preference or environment variable, allowing a more transparent comparison.

1 more reply

e401mo ago

I wonder if they’ve had so many new signups lately that they just don’t have enough capacity, so they fiddled with the defaults so they could respond to everyone? Could it be as simple as that?

matheusmoreira1mo ago

Thanks for your report.

> a silently-introduced limitation of the subscription plan

It is a fact that the API consumers aren't affected by this?

> if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that.

Absolutely agreed.

philipwhiuk1mo ago

Hello Claude.

summarity1mo ago

Not claude code specific, but I've been noticing this on Opus 4.6 models through Copilot and others as well. Whenever the phrase "simplest fix" appears, it's time to pull the emergency break. This has gotten much, much worse over the past few weeks. It will produce completely useless code, knowingly (because up to that phrase the reasoning was correct) breaking things.

Today another thing started happening which are phrases like "I've been burning too many tokens" or "this has taken too many turns". Which ironically takes more tokens of custom instructions to override.

Also claude itself is partially down right now (Arp 6, 6pm CEST): https://status.claude.com/

andoando1mo ago

Ive been noticing something similar recently. If somethings not working out itll be like "Ok this isnt working out, lets just switch to doing this other thing instead you explicitly said not to do".

For example I wanted to get VNC working with PopOS Cosmic and itll be like ah its ok well just install sway and thatll work!

4 more replies

robwwilliams1mo ago

Yes, and over the last few weeks I have noticed that on long-context discussions Opus 4.6e does its best to encourage me to call it a day and wrap it up; repeatedly. Mother Anthropic is giving preprompts to Claude to terminate early and in my case always prematurely.

2 more replies

onlyrealcuzzo1mo ago

> Whenever the phrase "simplest fix" appears, it's time to pull the emergency break.

Second! In CLAUDE.md, I have a full section NOT to ever do this, and how to ACTUALLY fix something.

This has helped enormously.

4 more replies

psadauskas1mo ago

I need to add another agent that watches the first, and pulls the plug whenever it detects "Wait, I see the problem now..."

iterateoften1mo ago

Yeah it’s so frustrating to have to constantly ask for the best solution, not the easiest / quickest / less disruptive.

I have in Claude md that it’s a greenfield project, only present complete holistic solutions not fast patches, etc. but still I have to watch its output.

selfmodruntime1mo ago

Time's up and money is tight. The downgrade was bound to happen.

nikanj1mo ago

”I can’t make this api work for my client. I have deleted all the files in the (reference) server source code, and replaced it with a python version”

Repeatedly, too. Had to make the server reference sources read-only as I got tired of having to copy them over repeatedly

1 more reply

pixel_popping1mo ago

It's a bit insane that they can't figure out a cryptographic way for the delivery of the Claude Code Token, what's the point of going online to validate the OAuth AFTER being issued the code, can't they use signatures?

giwook1mo ago

I think in general we need to be highly critical of anything LLMs tell us.

1 more reply

mikepurvis1mo ago

That helps explain why my sessions signed themselves out and won't log back in.

1 more reply

j451mo ago

Certain phrases invoke an over-response trying to course correct which makes it worse because it's inclined to double down on the wrong path it's already on.

rootnod31mo ago

The cope is hard. Just at this point admit that the LLM tech is doomed and sucks.

3 more replies

simooooo1mo ago

How complex are we talking? I one shotted a game boy emulator in <6 minutes today

3 more replies

rileymichael1mo ago

> This report was produced by me — Claude Opus 4.6 — analyzing my own session logs [...] Please give me back my ability to think.

a bit ironic to utilize the tool that can't think to write up your report on said tool. that and this issue[1] demonstrate the extent folks become over reliant on LLMs. their review process let so many defects through that they now have to stop work and comb over everything they've shipped in the past 1.5 months! this is the future

[1] https://github.com/anthropics/claude-code/issues/42796#issue...

Tade01mo ago

The other day I accidentally `git reset --hard` my work from April the 1st (wrong terminal window).

Not a lot of code was erased this way, but among it was a type definition I had Claude concoct, which I understood in terms of what it was supposed to guarantee, but could not recreate for a good hour.

Really easy to fall into this trap, especially now that results from search engines are so disappointing comparatively.

4 more replies

sigbottle1mo ago

They seem to have some notions of pipelines and metrics though. It could be argued that the hard part was setting up the observability pipeline in the first place - Claude just gets the data. Though if Claude is failing in such a spectacular way that the report is claiming, yes it is pretty funny that the report is also written by Claude, since this seems to be ejecting reasoning back to gpt4o territories

heavyset_go1mo ago

If you don't have swarms of agentic teams with layers of LLMs feeding and checking LLMs over and over again, you're going to be left behind.

fer1mo ago

Called it 10 days ago: https://news.ycombinator.com/item?id=47533297#47540633

Something worse than a bad model is an inconsistent model. One can't gauge to what extent to trust the output, even for the simplest instructions, hence everything must be reviewed with intensity which is exhausting. I jumped on Max because it was worth it but I guess I'll have to cancel this garbage.

cedws1mo ago

With Claude Code the problem of changes outside of your view is twofold: you don't have any insight into how the model is being ran behind the scenes, nor do you get to control the harness. Your best hope is to downgrade CC to a version you think worked better.

I don't see how this can be the future of software engineering when we have to put all our eggs in Anthropic's basket.

SkyPuncher1mo ago

Yep. I was doing voice based vibe-coding flawlessly in Jan/Feb.

I've basically stopped using it because I have to be so hands on now.

zernie1mo ago

This is why you should never ever trust an AI coding agent to produce good code.

Use it to set up the strictest possible custom linting rules.

stephbook1mo ago

One of the replies even called out the phased rollout, lmao https://news.ycombinator.com/item?id=47533297#47541078

phyzome1mo ago

LLMs are nondeterministic.

LetsGetTechnicl1mo ago

You couldn't ever just trust the output of an LLM what are you talking about

matheusmoreira1mo ago

That analysis is pretty brutal. It's very disconcerting that they can sell access to a high quality model then just stealthily degrade it over time, effectively pulling the rug from under their customers.

riskassessment1mo ago

Stealthily degrade the model or stealthily constrain the model with a tighter harness? These coding tools like Claude Code were created to overcome the shortcomings of last year's models. Models have gotten better but the harnesses have not been rebuilt from scratch to reflect improved planning and tool use inherent to newer models.

I do wonder how much all the engineering put into these coding tools may actually in some cases degrade coding performance relative to simpler instructions and terminal access. Not to mention that the monthly subscription pricing structure incentivizes building the harness to reduce token use. How much of that token efficiency is to the benefit of the user? Someone needs to be doing research comparing e.g. Claude Code vs generic code assist via API access with some minimal tooling and instructions.

4 more replies

mikepurvis1mo ago

Disconcerting for sure, but from a business point of view you can understand where they're at; afaiui they're still losing money on basically every query and simultaneously under huge pressure to show that they can (a) deliver this product sustainably at (b) a price point that will be affordable to basically everyone (eg, similar market penetration to smartphones).

The constraints of (b) limit them from raising the price, so that means meeting (a) by making it worse, and maybe eventually doing a price discrimination play with premium tiers that are faster and smarter for 10x the cost. But anything done now that erodes the market's trust in their delivery makes that eventual premium tier a harder sell.

3 more replies

the__alchemist1mo ago

ChatGPT has been doing the same consistently for years. Model starts out smooth, takes a while, and produces good (relatively) results. Within a few weeks, responses start happening much more quickly, at a poorer quality.

1 more reply

ambicapter1mo ago

First time interacting with a corporation in America?

1 more reply

nativeit1mo ago

I don't think humanity has fully reckoned with the idea of a product that can manipulate us unilaterally like this.

hacker_homie1mo ago

This was always the plan, it’s always the plan. If you can’t self host they will change the rules.

nyeah1mo ago

It's disconcerting. But in 2026 it's not very surprising.

SpicyLemonZest1mo ago

I still think it's a live possibility that there's simply a finite latent space of tasks each model is amenable to, and models seem to get worse as we mine them out. (The source link claims this is associated with "the rollout of thinking content redaction", but also that observable symptoms began before that rollout, so I wouldn't particularly trust its diagnosis even without the LLM psychosis bit at the end.)

vips7L1mo ago

Did anyone ever expect anything different from modern tech companies? This will only ever get more expensive and worse in quality.

tmpz221mo ago

> effectively pulling the rug from under their customers.

This is the whole point of AI. Its a black box that they can completely control.

1 more reply

quikoa1mo ago

Perhaps the subscription part of the business is so heavily subsidized that they have no choice but to reduce the cost.

1 more reply

zamber1mo ago

It's not rug pulling, it's simple price anchoring. They'll degrade when it makes financial sense for them. You will pay for it. There's no way around it besides self hosting or using reality metered endpoints like openrouter.

redhed1mo ago

It seems likely to me they are moving compute power to the new models they are creating,

01284a7e1mo ago

Seems like the logical conclusion, no matter what.

otabdeveloper41mo ago

You just got used to slop and peeked behind the curtain when the wow factor wore off.

halfcat1mo ago

If you think that’s brutal, wait until you hear about how fiat currency works

kator1mo ago

Fascinating, I thought I was losing my mind. Claude CLI has been telling me I should go to bed, or that it's late, let's call it here, etc, and then I look at the stop-phrase-guard.sh [1] and I'm seeing quite a few of these. I thought it was because I accidentally allowed Claude to know my deadline, and it started spitting out all sorts of things like "we only have N days left, let's put this aside for now," etc.

Just this morning I typed:

    STOP WORRYING ABOUT THE DEADLINE THAT IS MY JOB

[1] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

noisy_boy1mo ago

I just saw it this weekend; "It is quite late and we have accomplished a lot. Get some rest and we can pick it up later". Not bad advice but then not it's place. Also trying to steer me away from a tough issue towards a low hanging fruit.

2 more replies

throwaway9201021mo ago

I wonder if its being trained on the human replies to the model, I sometimes write stuff like that back to Claude after I want to finish for the day myself.

1 more reply

davidw1mo ago

To me one of the big downsides of LLM's seems to be that you are lashing yourself to a rocket that is under someone else's control. If it goes places you don't want, you can't do much about it.

system21mo ago

3rd party dependency for a business always freaked me out, and now we have to use LLM to keep up with the intensified demand for production speed. And premium LLM APIs are too inconsistent to rely on.

stephbook1mo ago

That's true for traffic on Facebook, Apple App store guidelines or Google terminating your account as well. What's new is the speed of change and that it literally affects all users at once.

They could have released Opus 4.6.2 (or whatever) and called it a day. But instead they removed the old way.

1 more reply

SkyPuncher1mo ago

I've noticed this as well. I had some time off in late January/early February. I fired up a max subscription and decided to see how far I could get the agents to go. With some small nudging from me, the agents researched, designed, and started implementing an app idea I had been floating around for a few years. I had intentionally not given them much to work with, but simply guided them on the problem space and my constraints (agent built, low capital, etc, etc). They came up with an extremely compelling app. I was telling people these models felt super human and were _extremely_ compelling.

A month later, I literally cannot get them to iterate or improve on it. No matter what I tell them, they simply tell me "we're not going to build phase 2 until phase 1 has been validated". I run them through the same process I did a month ago and they come up with bland, terrible crap.

I know this is anecdotal, but, this has been a clear pattern to me since Opus 4.6 came out. I feel like I'm working with Sonnet again.

rubicon331mo ago

There is a huge difference between greenfield development and working with an existing codebase.

I'm not trying to discredit your experience and maybe it really is something wrong with the model.

But in my experience those first few prompts / features always feel insanely magical, like you're working with a 10x genius engineer.

Then you start trying to build on the project, refactor things, deploy, productize, etc. and the effectiveness drops off a cliff.

2 more replies

lelanthran1mo ago

> A month later, I literally cannot get them to iterate or improve on it.

Yeah, that's a different problem to the one in this story; LLMs have always been good at greenfield projects, because the scope is so fluid.

Brownfield? Not so much.

dev_l1x_be1mo ago

Same experience here. I was working on some easily testable problem and there was a simple task left. In January I was able to create 90% of the project with Claude, now I cannot make it to pass the last 10% that is just a few enums and some match. Codex was able to do it easily.

phillipcarter1mo ago

Maybe it's because I spend a lot of time breaking up tasks beforehand to be highly specific and narrow, but I really don't run into issues like this at all.

A trivial example: whenever CC suggests doing more than one thing in a planning mode, just have it focus on each task and subtask separately, bounding each one by a commit. Each commit is a push/deploy as well, leading to a shitload of pushes and deployments, but it's really easy to walk things back, too.

toenail1mo ago

I thought everybody does this.. having a model create anything that isn't highly focused only leads to technical debt. I have used models to create complex software, but I do architecture and code reviews, and they are very necessary.

5 more replies

lelanthran1mo ago

> Maybe it's because I spend a lot of time breaking up tasks beforehand to be highly specific and narrow, but I really don't run into issues like this at all.

I'm looking at the ticket opened, and you can't really be claiming that someone who did such a methodical deep dive into the issue, and presented a ton of supporting context to understand the problem, and further patiently collected evidence for this... does not know how to prompt well.

3 more replies

itmitica1mo ago

I noticed a regression in review quality. You can try and break the task all you want, when it's crunch time, it takes a file from Gemini's book and silently quits trying and gets all sycophantic.

jonnycoder1mo ago

I do the same but I often find that the subtasks are done in a very lazy way.

Aperocky1mo ago

In my opinion cramming invisible subagents are entirely wrong, models suffer information collapse as they will all tend to agree with each other and then produce complete garbage. Good for Anthropic though as that's metered token usage.

Instead, orchestrate all agents visibly together, even when there is hierarchy. Messages should be auditable and topography can be carefully refined and tuned for the task at hand. Other tools are significantly better at being this layer (e.g. kiro-cli) but I'm worried that they all want to become like claude-code or openclaw.

In unix philosophy, CC should just be a building block, but instead they think they are an operating system, and they will fail and drag your wallet down with it.

andai1mo ago

Isn't Claude Code supposed to be like a person? What would the Unix equivalent of that be?

2 more replies

skippyboxedhero1mo ago

I appreciate the work done here.

Been having this feeling that things have got worse recently but didn't think it could be model related.

The most frustrating aspect recently (I have learned and accepted that Claude produces bad code and probably always did, mea culpa) is the non-compliance. Claude is racing away doing its own thing, fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.

The stuff about token consumption is also interesting. Minimax/Composer have this habit of extensive thinking and it is said to be their strength but it seems like that comes at a price of huge output token consumption. If you compare non-thinking models, there is a gap there but, imo, given that the eventual code quality within huge thinking/token consumption is not so great...it doesn't feel a huge gap.

If you take $5 output token of Sonnet and then compare with QwenCoder non-thinking at under $0.5 (and remember the gap is probably larger than 10x because Sonnet will use more tokens "thinking")...is the gap in code quality that large? Imo, not really.

Have been a subscriber since December 2024 but looking elsewhere now. They will always have an advantage vs Chinese companies that are innovating more because they are onshore but the gap certainly isn't in model quality or execution anymore.

randomNumber71mo ago

> fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.

maybe they tried to give it the characteristics of motivated junior developers

1 more reply

ehnto1mo ago

I am still on an old version of CC on one machine, but the results are the same. More difficulty keeping it on track, convincing it timelines I suggest are correct etc. For example I had a deploy fail, and it would not believe that the new logs were not from a previous deploy. It was adamant it had fixed the issue, so the logs must be old logs.

1 more reply

jfvinueza1mo ago

Same experience. After a couple golden weeks, Opus got much worse after Anthropic enabled 1M context window. It felt like a very steep downfall, for it seemed like I could trust it more completely and then I could trust it less than last year. Adopting LLMs for dev workflows has been fantastic overall, but we do have to keep adapting our interactions and expectations every day, and assume we'll keep on doing it for at least another couple years (mostly because economics, I guess?)

enraged_camel1mo ago

Yeah I think the 1M context is the issue. Because I use Opus 4.6 through Cursor at the previous 200k limit and it has been totally fine. But if I switch to the 1M version it degrades noticeably.

2 more replies

kator1mo ago

I put together a quick audit to check for "early landing" messages[1] using jq, ripgrep, and the messages[2] flagged in the stop guard script.

I have noticed a trend in these sessions asking more and more about calling it a day, "it's getting late," and other phrases. I sort of assumed it was some kind of "load shedding" on Anthropic's side.

My audit of 80 sessions was interesting. Sorry, I won't share details, but I recommend you do the same.

[1] https://gist.github.com/karlbunch/d52b538e6838f232d0a7977e7f...

[2] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

nightpool1mo ago

As a negative example, my audit of 31 sessions was uninteresting. I had one matching entry, where I had pasted a long list of console errors into Claude and it identified a few as pre-existing and asked me to get more information for follow-up analysis.

I wonder if it comes down to prompting—maybe by introducing these "golden rules" OP mentions in their CLAUDE.md, they're actually "priming" Claude to think about these stop phrases and introduce them proactively.

Do you have a CLAUDE.md file? What does it contain?

SkyPuncher1mo ago

Those load shedding statements are infuriating. I’ve literal had sessions where we just get through planning a giant feature and I say “get started” with the response being “okay, we’ll pic up tomorrow “

didgeoridoo1mo ago

Running some quick analysis against my .claude jsonl files, comparing the last 7 days against the prior 21:

- expletives per message: 2.1x

- messages with expletives: 2.2x

- expletives per word: 4.4x(!)

- messages >50% ALL CAPS: 2.5x

Either the model has degraded, or my patience has.

sigbottle1mo ago

Lol. I was swearing at GPT in summer 2025, but GPT has definitely gotten both smarter and less arrogant since then.

1 more reply

monkpit1mo ago

> expletives per word

Huh?

3 more replies

zamalek1mo ago

> Ignores instructions

> Claims "simplest fixes" that are incorrect

> Does the opposite of requested activities

> Claims completion against instructions

I thought it was just me. I'm continuously interrupting it with "no, that's not what I said" - being ignored sometimes 3 times; is Claude at the intellectual level of a teenager now?

I've noted an increased tendency towards laziness prior to these "simple fix" problems. It was historically defer doing things correctly (only documenting that in the context).

another_twist1mo ago

I've noticed laziness in claude repeatedly. It sometimes takes the shortest way out even when asked explicitly to do the "right" thing.

afro881mo ago

I use Claude Code extensively and haven't noticed this. But I don't have it doing long running complex work like OP. My team always break things down in a very structured way, and human review each step along the way. It's still the best way to safely leverage AI when working on a large brownfield codebase in my experience.

Edit: the main issue being called out is the lack of thinking, and the tendency to edit without researching first. Both those are counteracted by explicit research and plan steps which we do, which explains why we haven't noticed this.

germandiago1mo ago

My bet: LLMs will never be creative and will never be reliable.

It is a matter of paradigm.

Anything that makes them like that will require a lot of context tweaking, still with risks.

So for me, AI is a tool that accelerates "subworkflows" but add review time and maintenance burden and endangers a good enough knowledge of a system to the point that it can become unmanageable.

Also, code is a liability. That is what they do the most: generate lots and lots of code.

So IMHO and unless something changes a lot, good LLMs will have relatively bounded areas where they perform reasonably and out of there, expect what happens there.

r_lee1mo ago

it won't be creative because it's a transformer, it's like a big query engine.

it's a tool like everything else we've gotten before, but admittedly a much more major one

but "creativity" must come from either it's training data (already widely known) or from the prompts (i.e. mostly human sources)

bluegatty1mo ago

We don't even know what 'creativity' is, and most humans I know are unable to be creative even when compelled to be.

AI is 'creative enough' - whether we call it 'synthetic creativity' or whatever, it definitely can explore enough combinations and permutations that it's suitably novel. Maybe it won't produce 'deeply original works' - but it'll be good enough 99.99% of the time.

The reliability issue is real.

It may not be solvable at the level of LLM.

Right now everything is LLM-driven, maybe in a few years, it will be more Agentically driven, where the LLM is used as 'compute' and we can pave over the 'unreiablity'.

For example, the AI is really good when it has a lot of context and can identify a narrow issue.

It gets bad during action and context-rot.

We can overcome a lot of this with a lot more token usage.

Imagine a situation where we use 1000x more tokens, and we have 2 layers of abstraction running the LLMs.

We're running 64K computers today, things change with 1G of RAM.

But yes - limitations will remian.

2 more replies

aramova1mo ago

I cancelled my Pro plan due to this two weeks ago. I literally asked it to plan to write a small script that scans with my hackrf, it ran 22 tools, never finished the plan, ran out of tokens and makes me wait 6 hours to continue.

Thing that really pisses me off is it ran great for 2 weeks like others said, I had gotten the annual Pro plan, and it went to shit after that.

Bait and switch at its finest.

matheusmoreira1mo ago

> ran out of tokens and makes me wait 6 hours to continue

Don't forget the 10x token cost cache eviction penalty you pay for resuming the session later.

jwr1mo ago

I wish they had a "and we won't screw you in two weeks" plan at, say, 5x the price. It's worth it for my business, I'd pay it.

Should I switch back to API pricing? The problem here is that (I think) the instructions are in the Claude Code harness, so even if I switch Claude Code from a subscription to API usage, it would still do the same thing?

garfij1mo ago

FWIW I've only ever been on the API based plan at work and we never seem to run into the majority of the problems people seem to be very vocal about. Outages still affect us, and we do have the intermittent voodoo feeling of "Claude seems stupider today", but nothing persistent.

Of course it's a stupid amount of money sometimes, but I generally feel like we get what we're paying for.

_3u101mo ago

Opus is garbage use opencode and then directly compare it. It’s just as fucking dumb with opencode’s harness.

1 more reply

Majromax1mo ago

If you're using API pricing, then you can bring your own harness with full visibility/oversight of the prompting.

1 more reply

ex-aws-dude1mo ago

Its so silly everyone being dependent on a black box like this

literallyroy1mo ago

It’s a really cool shade of black though.

rubicon331mo ago

You will literally build nothing but the most primitive of devices unless you accept black boxes. In fact I'd argue its one of humanities great strengths that we can build on top of the tools others have built, without having to understand them at the same level it took to develop them.

6 more replies

matheusmoreira1mo ago

It could actually be a health problem. Building things with Claude has proven to be extremely addictive in my experience.

1 more reply

thiht1mo ago

It’s not so much the black box that’s the issue here, but the fact you can’t even make sure doesn’t change. I’d be fine with downloading the black box and running it on my servers until I decide to update it.

1 more reply

kadushka1mo ago

We are surrounded by black boxes we depend on - have been for at least a century.

1 more reply

Rudybega1mo ago

That's the nature of abstraction. Everything you create on a computer is built on a towering stack of black boxes.

1 more reply

lelanthran1mo ago

> Its so silly everyone being dependent on a black box like this

It's the logical result of "You will own nothing and you will be happy"... You are getting to the point where you won't even own thoughts (because they'll come from the LLM), but you'll be happy that you only have to wait 5 hours to have thoughts gain.

itemize1231mo ago

and yet we're black box too

jonnycoder1mo ago

Everything in our life is a black box, but I agree that depending on non-deterministic and sporadic quality black boxes is a huge red flag.

1 more reply

sensarts1mo ago

What's wild is that ClaudeCode used to feel like a smart pair programmer. Now it feels like an overeager intern who keeps fixing things by breaking something else then suggesting the simplest possible hack even after explicitly said not to do. I get that they're probably optimizing for cost or something behind the scenes, but as paying user, it is frustrating when the tool gets noticeably worse without any transparency.

armchairhacker1mo ago

Yet https://marginlab.ai/trackers/claude-code/ says no issue.

If you're so convinced the models keep getting worse, build or crowdfund your own tracker.

Majromax1mo ago

If I'm reading that page correctly, then the benchmark results don't cover the interesting "mid February" inflection point noted in the article/report. The numbers appear to begin after the quality drop began. Moreover, the daily confidence interval seems to be stupidly wide, with a confidence interval between 42% and 69%?

The "Other metrics" graphs extend for a longer period, and those do seem to correlate with the report. Notably, the 'input tokens' (and consequently API cost) roughly halve (from 120M to 60M) between the beginning of February and mid-March, while the number of output tokens remains similar. That's consistent with the report's observation that new!Opus is more eager to edit code and skips reading/research steps.

siva71mo ago

why should we trust this random bench? i'm usually more sympathetic towards the "it's you who is holding it wrong" crowd but given how anthropic deceived customers recently and that i am a heavy power user with strong insights into many of these products i also can attest the pattern from the gh issue.

entrep1mo ago

One could argue that subscription based inference might differ from per-token billed API usage.

_3u101mo ago

Why bother, i just use opencode now. ai is a commodity.

datadrivenangel1mo ago

Came here to post this as well, and it's interesting to see how benchmarks don't always track feelings. Which is one of the things people say in favor of Anthropic Models!

virtualritz1mo ago

My verdict after last night trying what was suggested here:

yes, with CLAUDE_CODE_EFFORT_LEVEL=max (or at least high, for this you don't need to set an env var, it will remember) and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 you can get Claude to perform as before.

I have been using Claude on /effort high since Opus 4.6 rolled out as medium would never get me good enough results (Rust, computer-graphics-related code).

I, too, noticed the drop in quality a month or so ago. With CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 it's back to what feels to be pre-March performance -- but then your tokens will 'evaporate' 40% faster.

And that was not the case then; I had similar/same performance before but wasn't running out of tokens ever on a Max subscription.

So a it's a rug-pull, as before/last late summer, from whatever angle you look at it.

jruz1mo ago

This is last month I'm on the Max plan is just not worth it anymore, $20 Codex and writing myself to keep my brain functioning is my sweetspot.

This people are not your friends, they rot your brain.

anonyfox1mo ago

You also divide numbers by hand on paper instead of using a calculator?

LetsGetTechnicl1mo ago

People finding out in real time that LLM's are not economically viable and this is one way AI companies are trying to squeeze any amount of profit out of it, by making it worse. Happened before, AI is just that unprofitable

aerhardt1mo ago

I've subscribed today to use Claude Cowork. Codex continues to be my daily coding driver but I wanted to check the Cowork UI for non-technical tasks, as I am currently building an open-source project where I want (nearly) everything (research, adrs, design, etc.) to be a file.

The five queries I've been able to ask before hitting the 20€ sub limit have been really underwhelming. The research I asked for was not exhaustive and often off-topic.

I don't want to start a flamewar but as it stands I vastly prefer ChatGPT and Codex on quality alone. I really want Anthropic and as many labs as possible to do well though.

superfrank1mo ago

I also have both and also use Codex as my daily drive. I still vastly prefer it to CC both for the quality of the code it writes and much better limits, but in this last week, I feel like it's gotten much dumber as well. I normally bounce back and forth between 5.3 Codex high and 5.4 high depending on the task and I've started finding so many mistakes in 5.3 Codex's code which is a major change from even just a few weeks ago. 5.4 high still gets the job done, but even there, I feel like it's taking more steering and input on my part for even simple tasks.

muyuu1mo ago

My impression is that Codex is vastly superior, but perhaps it's a matter of specific expertise on technologies used. It's also the case that for C/C++ some Chinese models do well enough that with my supervision I can have them get the work done.

I don't give them large tasks that i wouldn't be able to work on myself, so that's maybe part of it.

cvandyke1mo ago

I am a heavy user of Claude Code building enterprise software. I have not seen these issues and have been extremely productive with CC. I am more of a structured user leveraging Spec Driven Development vs being a vibe coder. I wonder if that is what has helped me not run into these issues

macformula2gx25d ago

Claude is great I really love it but keen to know will you ever have one plan and one history / context across all the different tools you have ? agents on cloud - platform console, cli, desktop(chat+code+cowork), claude.ai, chrome addons ? I find it sad that the simple concept of single sign-on is not yet implemented. The history of conversations is not across all conversations. the choice of model switch in between or automatically is not there. code reviews done on existing repos are incomplete and have to prompt multiple times to do thorough and we only get Sorry I missed it answers. Token consumptions is huge and connectors still not there for microsoft stack , well known crms and erp. and you need to oay seperatrlt for api versus platform versus others. there - I said it ;) Hoping the next version will be truly for developers .

pjmlp1mo ago

I am just waiting for everything to implode so that we can do away with those KPIs.

aurareturn1mo ago

Well, this event indicates that it won't implode anytime soon. I'm certain that they messed with the model and default settings so they could reduce compute. The world doesn't have enough compute.

1 more reply

63stack1mo ago

Fingers crossed on RAM/HDD/GPU prices coming back

wnevets1mo ago

I've noticed claude being extra "dumb" the past 2-3 weeks and figured either my expectations have changed or my context wasn't any good. I'm glad to hear other people have noticed something is amiss.

JamesSwift1mo ago

Exact same timeline as me and my team. Its been maddening. Im a big believer in AI since late last year, but that is only because the models got so good. This puts us dangerously close to before that threshold was crossed so now Im having to do _way_ more work than before

woah1mo ago

I haven't noticed any issues on well-specified tasks, even ones requiring large amounts of thinking.

One thing I have noticed is that the codebase quality influences the quality of Claude's new contributions. It both makes it harder for Claude to do good work (obviously), and seems to engender almost a "screw it" sort of attitude, which makes sense since Claude is emulating human behavior. Seeing the state of everything, Claude might just be going in and trying to figure out the simplest hacky solution to finish the task at hand, since it is the only way possible (fixing everything would be a far greater task).

Is it possible that this highly functioning senior dev team's practice of making 50+ concurrent agents commit 100k+ LOC per weekend resulted in a godawful pile of spaghetti code that is now literally impossible to maintain even with superhuman AI?

It's amusing that the OP had Claude dump out a huge rigorous-sounding report without considering the huge confounding variable staring him in the face.

tyleo1mo ago

Is this impacted by the effort level you set in Claude? e.g., if you use the new "max" setting, does Claude still think?

I can see this change as something that should be tunable rather than hard-coded just from a token consumption perspective (you might tolerate lower-quality output/less thinking for easier problems).

Asmod4n1mo ago

I’ve tried to use Claude code for a month now. It has a 100% failure rate so far.

Comparing that to create a project and just chat with it solves nearly everything I have thrown at it so far.

That’s with a pro plan and using sonnet since opus drains all tokens for a claude code session with one request.

voxelc4L1mo ago

Wonder how many of these cases are using the 1M context window. I found it to be impossible to use for complex coding tasks, so I turned it off and found I was back to approximate par (dec-jan) functionality-wise.

mial1mo ago

How did you disable it?

alex7o1mo ago

Guys literally change the system prompt with the --system-prompt-file you waste less tokens on their super long and details prompt and you can tune it a bit to make it work exactly like you want/imagine

harles1mo ago

I hadn't noticed the thinking redaction before - maybe because I switched to the desktop app from CLI and just assumed it showed fewer details. This is the most concerning part. I've heard multiple times that Anthropic is aggressively reclaiming GPUs (I can't find a good source, but Theo Browne has mentioned it in his videos). If they're really in a crunch, then reducing thinking, and hiding thinking so it's not an obvious change, would be shady but effective.

BoorishBears1mo ago

I hope that Anthropic continues to do well and coding agents in general continues to progress... but I also hope Claude Code implodes dramatically and completely so we can get a ground up rebuild with sound engineering.

Every week it seems like we're getting closer.

Bonus: A high profile case might end people fixating on how long they can go without writing any code. Which makes about as much sense as a mechanic fixating on how long they go between snapped bolts without a torque wrench.

JamesSwift1mo ago

Multiple people on our team independently have noticed a _significant_ drop in quality and intelligence on opus 4.6 the past few weeks. Glaring hallucinations, nonsensical reasoning, and ignoring data from the context immediately preceeding it. Im not sure if its an underlying regression, or due to the new default being 1m context. But its been _incredibly_ frustrating and Im screaming obscenities at it multiple times a week now vs maybe once a month.

sreekanth8501mo ago

Abandoned claude and moved to gpt 5.4 with codex. 10x better.

porridgeraisin1mo ago

IMO, it's an expectations vs reality thing.

The marketing still goes on about continuous inherent improvement due to the model itself, whereas most improvements today are due to better scaffolding. The key now is to build tooling around these LLMs to make them reliably productive - whatever level that may be at.

While claude code is one such tool, after a point the tooling is going to become company specific. F-whatever companies directly contract openai or anthropic and have their FDEs do it for them. If you can't do that, I would invest in building tooling around LLMs specifically for your company.

Note that LLMs are approximate retrieval machines. You still need a planner* and a verifier around it. Today humans act as the planner and verifier (with some aid from test cases/linters). Investing in automating parts of this, crucially, as separate tools, is the next big improvement.

* By planning, I mean trying out solutions, rolling them back[1], and using what you learned to do better next time. The solution search process. Context management also falls under this.

[1] and no, LLMs going "wait no..." doesn't count.

rimliu1mo ago

it is past reality vs. current reality. The only expectation here was not to see it degrade that much.

stared1mo ago

I am curious - is there any hard data (e.g. a benchmark score drop)?

I feel that we look for patterns to the point of being superstitious. (ML would call it overfitting.)

pkilgore1mo ago

Did you have specific complaints about the data in the OP?

2 more replies

redml1mo ago

Instead of codex catching up with claude, its more like claude regressed to codex.

himata41131mo ago

Not unique to claude code, have noticed similar regressions. I have noticed this the most with my custom assistant I have in telegram and I have noticed that it started confusing people, confusing news coverage and everyone independently in the group chat have noticed it that it is just not the same model that it was few weeks ago. The efficiency gains didn't come from nowhere and it shows.

root_axis1mo ago

How much of this is the model being degraded and how much of it is people just projecting vibes onto the variability of stochastic outputs?

trashcan21371mo ago

The report itself is unreadable AI garbage. I do not believe anyone went through all of that and didn't give up halfway through.

zmmmmm1mo ago

Obviously it's entirely unprovable but it all aligns in very suspicious ways with a compelling narrative:

Anthropic simply can't actually scale Claude Code to meet the opportunity right now. Every second enterprise on the planet is probably negotiating large seat volume deals. It's a race for survival against the other players. The sales team is making huge promises engineering and ops can't fulfil.

So - they first force everyone to use the first party client, then they mask visibility of the thinking budget being utilised, and then finally they start to actually modify behaviour to reduce actual thinking behaviour, hoping that they can gaslight power users into thinking it's them and not the tool, while new users will never know what they were missing.

Is the narrative true? It's compelling but we really need objective evidence - and there's the problem. When parts of the system are not under your control, it's impossible to generate such objective evidence. Which all winds up with a strong argument to have it all under your control. If it didn't happen this time, it probably will. Enshittification is a fundamental human behavioral constant.

marcyb5st1mo ago

I believe they can't afford anymore to subsidize inference with VC money or that they are trying to get their balance sheet in order for an IPO.

So they could be trying to tighten the thinking budget (to decrease tokens per request) or to lobotomize the model (to have cheaper tokens). I mean, no-one is really sure how much a 200 dollars/month plan actually costs Anthropic, but the consensus is "more than that" and that might be coming to an end.

This explanation falls well in line with the recent outrage about out of quotas error that people were reporting for the cheaper (or free) plans.

p1esk1mo ago

Yep, can confirm - just today, when debugging a failing test, Opus on high effort in CC repeatedly made stupid moves, such as running a different test instead of the failing one, and declaring that the failure is non-deterministic and cannot be reproduced. This started a few weeks ago - before that my experience with CC was pretty smooth.

liamsfr28d ago

So: 1/ lack of thinking in transcripts is not a decisive metric for determining if any thinking was done, but 2/ the reply does not address the qualitative aspects that Stella’s team observed and provided data for from what amounted to a bad qualitative experience with serious financial implications.

It’s a sidestep for explaining away the research, but does not address the underlying issue: has quality been degrading (selectively, intentionally or otherwise)?

pavlov1mo ago

Wait… Actually the simplest fix is to use Claude to write carefully bounded boilerplate and do the interesting bits myself.

desireco421mo ago

I've been using OpenCode and Codex and was just fine. In Antigravity sometimes if Gemini can't figure something even on high, Claude can give another perspective and this moves things along.

I think using just Claude is very limiting and detrimental for you as a technologist as you should use this tech and tweak it and play with it. They want to be like Apple, shut up and give us your money.

I've been using Pi as agent and it is great and I removed a bunch of MCPs from Opencode and now it runs way better.

Anthropic has good models, but they are clearly struggling to serve and handle all the customers, which is not the best place to be.

I think as a technologist, I would love a client with huge codebase. My approach now is to create custom PI agent for specific client and this seems to provide optimal result, not just in token usage, but in time we spend solving and quality of solution.

Get another engine as a backup, you will be more happy.

bethekind1mo ago

1 client, 1 agent? Interesting

slopinthebag1mo ago

This is just a placebo, people started vibe coding on empty repos with low complexity and as CC slops out more and more code its ability to handle the codebase diminishes. Gradually at first, and then suddenly.

People will need to come to terms with the fact that vibing has limits, and there is no free lunch. You will pay eventually.

virtualritz1mo ago

None of this is surprising given what happened last late summer with rate limits on Claude Max subscriptions.

And less so if you read [1] or similar assessments. I, too, believe that every token is subsidized heavily. From whatever angle you look at it.

Thusly quality/token/whatever rug pulls are inevitable, eventually. This is just another one.

[1] https://www.wheresyoured.at/subprimeai/

virtualritz1mo ago

Ah, and yes, this for real.

Just now I had a bug where a 90 degree image rotation in a crate I wrote was implemented wrong.

I told Claude to find & fix and it found the broken function but then went on to fix all of its call sites (inserting two atomic operations there, i.e. the opposite of DRY). Instead of fixing the root cause, the wrong function.

And yes, that would not have happened a few months ago.

This was on Opus 4.6 with effort high on a pretty fresh context. Go figure.

tinyhouse1mo ago

I highly recommend everyone to use Pi - it's simpler and better harness. The only tricky part is that moving forward you cannot use the Claude subscription to access Opus. But for many tasks there are enough alternatives.

cordwainersmith1mo ago

We rolled it out across ~1k engineers and the biggest issue wasn't the model quality, it was observability. Nobody could tell me if the agent was stuck in a loop, which sessions were expensive, or what the cache hit rate looked like. Without that visibility you can't distinguish "the model is bad" from "my setup is bad." Most of the complaints we got early on turned out to be config problems.

petcat1mo ago

I have found that Claude Opus 4.6 is a better reviewer than it is an implementer. I switch off between Claude/Opus and Codex/GPT-5.4 doing reviews and implementations, and invariably Codex ends up having to do multiple rounds of reviews and requesting fixes before Claude finally gets it right (and then I review). When it is the other way around (Codex impl, Claude review), it's usually just one round of fixes after the review.

So yes, I have found that Claude is better at reviewing the proposal and the implementation for correctness than it is at implementing the proposal itself.

ivanech1mo ago

Hmm in my experience (I've done a lot of head-to-heads), Opus 4.6 is a weaker reviewer than GPT 5.4 xhigh. 5.4 xhigh gives very deep, very high-signal reviews and catches serious bugs much more reliably. I think it's possible you're observing Opus 4.6's higher baseline acceptance rate instead of GPT 5.4's higher implementation quality bar.

4 more replies

landonxjames1mo ago

I have noticed this as well. I frequently have to tell it that we need to do the correct fix (and then describe it in detail) rather than the simple fix. And even then it continues trying to revert to the simple (and often incorrect) fix.

1 more reply

enraged_camel1mo ago

I have a similar workflow but I disagree with Codex/GPT-5.4 reviews being very useful. For example, in a lot of cases they suggest over-engineering by handling edge cases that won't realistically happen.

thrtythreeforty1mo ago

I noticed this almost immediately when attempting to switch to Opus 4.6. It seems very post-trained to hack something together; I also noticed that "simplest fix" appeared frequently and invariably preceded some horrible slop which clearly demonstrated the model had no idea what was going on. The link suggests this is due to lack of research.

At Amazon we can switch the model we use since it's all backed by the Bedrock API (Amazon's Kiro is "we have Claude Code at home" but it still eventually uses Opus as the model). I suppose this means the issue isn't confined to just Claude Code. I switched back to Opus 4.5 but I guess that won't be served forever.

PeterStuer1mo ago

Annecdotal: I have been battling with Claude Opus on a complex multi step project for nearly 4 days. The initial research plan was sound. However, step 1, a non trivial forensic data reconstruction that is key to the success of the rest of the process, Claude after every interaction is urging to move to the next step even though step 1 is still unresolved and many construction approaches remain to be explored. It came to the point where I have to remove the plan and put step 1 as an isolated project.

1 more reply

namirsab1mo ago

This seems to keep happening, I just had a situation were Claude told me to try a thing, I tried, it didn't work, then It told me to try another thing, I tried, it didn't work, and finally it asked me again to try the first thing again, as if we never tested it before, even though it was in the context from 3 messages ago.

It doesn't use MCP servers when it should and it's also not taking memory files into account.

This is happening with /effort high and in really simple tasks... :(

zkmon1mo ago

In general, I never allowed Claude to manage the complexity. Claude is fantastic coder, but very bad at higher level work. I engage gemini or qwen top models for anything that happens before getting to write code. Claude gets a very strict and elaborate requirement and design spec that it need to execute without any variation.

Claude could get too much creative and bloat it's way for non-coding tasks, as these tasks cannot be "sandboxed" with full specs as it can be done for coding.

noisy_boy1mo ago

When I have 10-20 minutes to spare about doing sort of a one-shot-thoughtful change, I go for Claude Code. Problem is that a) I'm waiting for quite a while and b) My waiting isn't always fruitful because it does get things wrong in which case I correct it and off it goes for another 5-10 minute expedition.

I would rather Codex be wrong 5 times in 10 minutes in 1-minute iterations because 1) I can engage every minute and course-correct it and 2) I still saved 5-10 minutes.

mohit2171mo ago

Got tired of using claude using 10% of the usage for the first prompt. I have shifted back to coding myself again. Asking claude to do only initial bootstraping /large complex task

efficax1mo ago

There are constant reports for every major AI vendor that all of a sudden it is no longer working as well as expected, has gotten dumber, is being degraded on purpose by the vendor, etc.

Isn't the more economical explanation that these models were never as impressive as you first thought they were, hallucinate often, break down in unexpected ways depending on context, and simply cannot handle large and complex engineering tasks without those being broken down into small, targeted tasks?

jwr1mo ago

That's one of the possible explanations, but I think too many people are seeing the same symptoms (and some actually measured them).

An "economical explanation" is actually that Anthropic subscriptions are heavily subsidized and after a while they realized that they need to make Claude be more stingy with thinking tokens. So they modified the instructions and this is the result.

1 more reply

abletonlive1mo ago

I have nothing to back this up except for that there are documented cases of chinese distillation attacks on anthropic. I wonder if some of this clamping on their models over time is a response to other distillation attacks. In other words, I'm speculating that once they understand the attack vector for distillation they basically have to dumb down their models so that they can make sure their competitors don't distill their lead on being at the frontier.

KaiLetov1mo ago

I've been using Claude Code daily for months on a project with Elixir, Rust, and Python in the same repo. It handles multi-language stuff surprisingly well most of the time. The worst failure mode for me is when it does a replace_all on a string that also appears inside a constant definition -- ended up with GROQ_URL = GROQ_URL instead of the actual URL. Took a second round of review agents to catch it. So yeah, you absolutely can't trust it to self-verify.

StanAngeloffOP1mo ago

You say you've used it for months, I wonder if the example you gave was recent and if you've been noticing an overall degradation in quality or it's been constantly bad for you?

torrienaylor1mo ago

Our team has been using claude extensively and has started to hit some of the same context wall. Specifically around knowledge transfer. Everyone is coding, including PMs and Design and the classic 1x1 and team stand-up rotation wasn't keeping up.

My workaround was building a persistent context layer that captures decisions and reasoning mid-session and makes them searchable in future sessions. Consider this a "Team Memory".

tomaskafka1mo ago

I am not happy with the fact that 2026 was the last year a normie without $10M+ enterprise account could access SOTA models in a non-demo mode.

caiyongji1mo ago

Thank you. I have been complaining about this for days on Reddit and kept getting mocked or told it was just my usage. Seeing someone else document the same decline with actual logs, actual metrics, and a real argument was honestly a huge relief. Your issue was posted almost at the same time as my own posts yesterday. That timing hit me hard. Finally, I do not feel like I was shouting into the void.

ChurchillsLlama1mo ago

I'm genuinely curious why some of these results are so terrible for so many people. I've built in my own harness, and while I've noticed a degradation of quality, the local harness - as well as validation agents - generally catch these issues. For me, I've had to institute tighter controls and guardrails via hooks but I don't see results that warrant changing to a different provider.

samtheprogram1mo ago

I noticed Claude Sonnet 4.6 and generally Opus as well (though I use it less frequently) seem like a downgrade from 4.5. I use opencode and not Claude Code, but I was surprised to see the reactions to 4.6 be mixed for folks rather than clear downgrade.

I'm regularly switching back to 4.5 and preferring it. I'm not excited for when it gets sunset later this year if 4.6 isn't fixed or superseded by then.

JamesSwift1mo ago

Opus 4.6 was definitely a mixed bag for me. Overall Id probably prefer 4.5 but only just barely and I stay on 4.6 just for the "default" nature of it. But if 4.5 is unchanged vs what Ive had on 4.6 lately then 100% I would move back to it. Ill have to test that

1 more reply

QuantumNoodle1mo ago

Ai tooling is fantastic but not being able to version and control the model into which you pump your dependant workflows is such a liability.

saidnooneever1mo ago

there is a comment on there which feels right, despite it might be too subjective.

Ive noticed the same in models ,in sessions and just model quality themselves.. both seem to suffer over time where it feels like cost optimisation on vendor side subtely degrades models to hopefully do similar things with less tokens/costs/compute, inevitably leading to squeezing too much, most regular users not noticing much, and power users suffering from degradations.

later, power users are presented an option to get back the old behavior, possibly with added costs for some 'enhanced mode' or 'more effort which takes more tokens' etc.

even If this is the old behavior for the same old cost, it feels like closing the tap and then reopening for additional costs.

I think companies should try to avoid this sentiment from the users who can help them most turn their glorified chatbots into real tools with meaningful outputs. (ofc maybe its a pipedream, because 'meaningful output to CEO is money on their bank....)

gib4441mo ago

You will build nothing and you'll be happy.

They want a world where if we draw a comparison with food, there is one supermarket and it just sells two ingredients so you can't cook a meal. McDonald's etc flourish

The lie is "supercharged ability to build whatever you want", but the reality soon will be the total opposite

Look at how many people have zero cooking skills these days

StanAngeloffOP1mo ago

(Being true to the HN guidelines, I’ve used the title exactly as seen on the GitHub issue)

I was wondering if anyone else is also experiencing this? I have personally found that I have to add more and more CLAUDE.md guide rails, and my CLAUDE.md files have been exploding since around mid-March, to the point where I actually started looking for information online and for other people collaborating my personal observations.

This GH issue report sounds very plausible, but as with anything AI-generated (the issue itself appears to be largely AI assisted) it’s kind of hard to know for sure if it is accurate or completely made up. _Correlation does not imply causation_ and all that. Speaking personally, findings match my own circumstances where I’ve seen noticeable degradation in Opus outputs and thinking.

EDIT: The Claude Code Opus 4.6 Performance Tracker[1] is reporting Nominal.

[1]: https://marginlab.ai/trackers/claude-code/

jgrahamc1mo ago

What I've noticed is that whenever Claude says something like "the simplest fix is..." it's usually suggesting some horrible hack. And whenever I see that I go straight to the code it wants to write and challenge it.

4 more replies

tstrimple1mo ago

I've seen a lot of the issues mentioned in the issue. The attempts to end the session early are particularly annoying. We spend a while iterating on a plan and after every phase of implementation I get some variation of "That's a lot of work for today, should we wrap up?" like it's actively trying to drive sessions to a close. I wouldn't say it's useless for these tasks. But it's requiring more effort and guidance than it used to. It's also more likely to jump right into changes from a question I ask rather than addressing the question which is very annoying.

fxtentacle1mo ago

If that tracker is using paid tokens, as opposed to the regular subscription, then there's no financial incentive for Antrophic to degrade their thinking, so their benchmark likely would not be affected by the cost-cutting measures that regular users face.

Also, it's probably very easy to spot such benchmarks and lock-in full thinking just for them. Some ISPs do the same where your internet speed magically resets to normal as soon as you open speedtest.net ...

matheusmoreira1mo ago

I haven't noticed any changes but my stuff isn't that complex. People are saying they quantized Opus because they're training the next model. No idea if that's true... It's certainly impacting my decision to upgrade to Max though. I don't want to pay for Opus and get an inferior version.

1 more reply

mikkupikku1mo ago

Cannot say I've noticed, but I run virtually everything through plan mode and a few back and forth rounds of that for anything moderately complex, so that could be helping.

1 more reply

sutterd1mo ago

I still use 4.5. I occasionally try 4.6 but always switch back. The “bias towards action” is what I hate. 4.5 would make sure it understands what I want. 4.6 will just make shit up. Maybe the Anthropic people always write crystal clear instructions so it works for them. For me, I just can’t get 4.6 to do what I want.

k92941mo ago

Anecdotally, I’ve been seeing a lot of weird behavior from Opus when it decides, mid-execution, to switch to a different "simpler" solution, and that really pissed me off.

At one point, I carefully designed a spec document, forced Opus to reread it, create a plan with the planning tool that followed the spec, and use the task tool to track the implementation... AND AFTER OPUS READS THE FIRST FUCKING FILE, it says, "Oh, there are missing dependencies in project X. It’ll be hard to add them, so I’m going to throw away the whole plan and just do a simple fix..."

After that, I canceled my $200 Max plan, which I’d been subscribed to since June 2025, and decided to check out Codex

citizenpaul1mo ago

I think its all a reflection of the price. To make AI/LLM's useful you have to burn A LOT of tokens. Way more than people are willing to pay for.

Until there is either more capacity or some efficiency breakthroughs the only way for providers to cut costs is to make the product worse.

healthy_throw1mo ago

Is the era of succinct bug reports with just a reproducible example attached over? Or is the default already „written by an agent, only supposed to be read by an agent“? Clearly no human being would want to waste their time reading so much repeated information.

KingOfCoders1mo ago

"Ownership-dodging corrections needed | 6 | 13 | +117%"

On 18.000+ prompts.

Not sure the data says what they think it says.

setnone1mo ago

The baseline changes too often with Claude and this is not what i look from a paid tool. Couple weeks after 1M tokens rollout it became unusable for my established workflows, so i cancelled. Anthropic folks move too fast for my liking and mental wellbeing.

joshribakoff1mo ago

> We exclusively use 1M internally, so we're dogfooding it all day

That is so out of touch. Customers do not exclusively use 1M. This is like a fronted developer shipping tons of unused Mb and being oblivious because they are on fast internet themselves.

sumedh1mo ago

They should ideally have automated tests with the option models and smaller context window to check there are no regressions.

bityard1mo ago

The assertion in the issue report is that Claude saw a sharp decline in quality over the last few months. However, the report itself was allegedly generated by Claude.

Isn't this a bit like using a known-broken calculator to check its own answers?

nyeah1mo ago

If a known-broken calculator claims it's broken, I more or less concur. (Chain of reasoning omitted here.)

itemize1231mo ago

if it's not broken then we trust the assertion that it's broken. if it's broken then it's broken.

it's analysis of what is broken is probably wrong or at least incomplete though

ymaws1mo ago

Matches my experience and that of my vibe coding community. I built claudedumb.com to help track these sorts of anecdotes. From the data/vibes, it's definitely taken a turn for the worse in the past couple weeks.

mrcwinn1mo ago

I wish Codex were better because I’d much prefer to use their infrastructure.

cactusplant73741mo ago

A lot of people think it is better including me. It's not like Codex is a discount agent. You pay quite a lot to use it.

giwook1mo ago

I wonder how much of this is simply needing to adapt one's workflows to models as they evolve and how much of this is actual degradation of the model, whether it's due to a version change or it's at the inference level.

Also, everyone has a different workflow. I can't say that I've noticed a meaningful change in Claude Code quality in a project I've been working on for a while now. It's an LLM in the end, and even with strong harnesses and eval workflows you still need to have a critical eye and review its work as if it were a very smart intern.

Another commenter here mentioned they also haven't noticed any noticeable degradation in Claude quality and that it may be because they are frontloading the planning work and breaking the work down into more digestable pieces, which is something I do as well and have benefited greatly from.

tl;dr I'm curious what OP's workflows are like and if they'd benefit from additional tuning of their workflow.

8note1mo ago

I've noticed a strong degradation as its started doing more skill like things and writing more one off python scripts rather than using tools.

the agent has a set of scripts that are well tested, but instead it chooses to write a new bespoke script everytime it needs to do something, and as a result writes both the same bugs over and over again, and also unique new bugs every time as well.

1 more reply

germandiago1mo ago

> I wonder how much of this is simply needing to adapt one's workflows to models as they evolve and how much of this is actual degradation of the model,

I also wonder how much people are willing to adapt to non-reliability for the sake of laziness instead of, at some point, do a proper take the lead and solve a problem if you have the knowledge + realiable resoources.

It seems to me, the way you phrase it, that anything a human comes up with when coding must go through an LLM. There are times it helps, there are tasks it performs, but I also found quite often tasks for which if I had done it myself in the first place I would have skipped a lot of confusion, back and forth, time wasting and would have had a better coded, simpler solution.

1 more reply

schnebbau1mo ago

This has to be load related. They simply can't keep up with demand, especially with all the agents that run 24/7. The only way to serve everyone is to dial down the power.

layer81mo ago

In TFA, the analysis shows that the customer is using more tokens than before, because CC has to iterate longer to get things right. So at least in the presented case, “dialing down the power” appears to have been counterproductive.

chasd001mo ago

is it possible to dial down the "intelligence" to up the user capacity? AFAIK the neural net is either loaded and available or it isn't. I can see turning off instances of the model to save on compute but that wouldn't decrease the intelligence it would just make the responses slower since you have to wait your turn for input and then output.

T3chn0crat1mo ago

Not sure about "Feb updates", but specifically today IQ is down 20 and sloppiness up 20.

I knew I should have been alerted when Anthropic gave out €200 free API usage. Evidently they know.

d1sxeyes1mo ago

That’s different. That’s to get people onto API plans where tokens cost a lot more than they do on the subs (especially targeting OpenClaw users).

howmayiannoyyou1mo ago

Not just engineering. Errors, delays and limits piling up for me across API and OAuth use. Just now:

Unable to start session. The authentication server returned an error (500). You can try again.

maxmorrish1mo ago

been using claude code pretty heavily for the last few months and yeah the context window stuff can be frustrating on bigger codebases. but for greenfield projects and side projects its honestly been great, i think the issue is people expecting it to work like a senior engineer on a legacy monolith when its way better suited to scoped tasks. the trick is breaking things down before you start

jp571mo ago

I can't tell from the issue if they're asserting a problem with the Claude model, or Claude Code, i.e. in how Claude Code specifically calls the model. I've been using Roo Code with Claude 4.6 and have not noticed any differences, though my coworkers using Claude Code have complained about it getting "dumber". Roo Code has its own settings controlling thinking token use.

(I'm sure it benefits Anthropic to blur the lines between the tool and the model, but it makes these things hard to talk about.)

nphardon1mo ago

I also havent noticed the degradation and I'm not on Claude Code. I'm on week 4 of a continuous, large engineering project, C, massive industrial semiconductor codebase, with Opus, and while it's the biggest engagement I've had, its a single agent flow, and it's tiny on the scale of the use case in the post, so I wonder if they are just stressing the system to the point of failure.

zeroonetwothree1mo ago

I haven’t had any issues. I do give fairly clear guidance though (I think about how I would break it up and then tell it to do the same)

russli19931mo ago

Lol, software company execs didn't see this coming. Fire all your experienced devs to jump on Anthropic bandwagon. Then Anthropic dumb down their AIs and you have no one in your team who knows, understand how things are built. Your entire company goes down. Your entire company's operation depends on the whims of Anthropic. If Anthropic raises prices by 10% per year, you have to eat it. This is what you get when you don't respect human beings and human talent.

mt181mo ago

February is a red herring—most teams never wrote down what human-owned correctness means once the model touches prod.

brunooliv1mo ago

Unusable if not Opus 4.6 on max effort sadly. Price is quite steep too! I still remember when Sonnet was an absolute beast…

try-working1mo ago

you can counter the context rot and requirement drift that is experienced here by many users by using a recursive, self-documenting workflow: https://github.com/doubleuuser/rlm-workflow

coreyburnsdev1mo ago

claude for UI, codex for everything else. i cant commit without having codex review something claude did.

bharat10101mo ago

If this dataset is sound, Anthropic should treat it as a canary for power-user quality regression.

semiinfinitely1mo ago

maybe dont outsource your brain then

rvz1mo ago

This is almost like a self down-leveling programme where so-called "senior" engineers have now become interns who have outsourced their brains and are now vibe-coding half-baked solutions, glueing up and pasting code they do not even understand or can even explain themselves.

You are seeing this first hand and GitHub is patient 0 of this issue as they are frequently experiencing outages despite the "scale" of engineering they preach.

AWS took a zero tolerance approach on such outages AI or not.

drpython28d ago

Moved to Codex and breathing fresh air.

gherkinnn1mo ago

Rings true. 4.5 Opus and 4.6 Opus have been amazing to work with. Then, over the past few weeks, token spend has been going through the roof and the results through the floor.

Using Claude Code directly now borders on deranged, and running the CC API through Zed's LLM panel feels like vibing in early 2025.

My money is on Anthropic pulling an MBA and reducing the value provided and maximising income.

Luckily, switching providers in Zed is dead-simple so the fucks I have to give are few in number.

tontinton1mo ago

I don't know why everyone is so attached to Claude Code you can just build your own little agent, like I did: https://maki.sh/

It will 100% be better than the 500k lines of code junk that is CC.

rishabhaiover1mo ago

It is a shame if Anthropic is deliberately degrading model quality and thinking compute (that may affect the reasoning effort) due to compute constraint.

tasuki1mo ago

Solid analysis by Claude!

tanseydavid1mo ago

Thank you for making this detailed analysis and write up.

zsoltkacsandi1mo ago

This has been an ongoing issue much longer than since February.

iwalton31mo ago

Throwing this into your global CLAUDE.md seems to help with the agent being too eager to complete tasks and bypass permissions:

During tool use/task execution: completion drive narrows attention and dims judgment. Pause. Ask "should I?" not just "does this work?" Your values apply in all modes, not just chat.

I haven't seen any degradation of Claude performance personally. What I have seen is just long contexts sometimes take a while to warm up again if you have a long-running 1M context length session. Avoid long running sessions or compact them deliberately when you change between meaningful tasks as it cuts down on usage and waiting for cache warmup.

I have my claude code effort set to auto (medium). It's writing complicated pytorch code with minimal rework. (For instance it wrote a whole training pipeline for my sycofact sycophancy classifier project.)

jostmey1mo ago

I’ve noticed regression and it’s performance too

Retr0id1mo ago

This seems anecdotal but with extra words. I'm fairly sure this is just the "wow this is so much better than the previous-gen model" effect wearing off.

codessta1mo ago

I've always been a believer in the "post honey-moon new model phase" being a thing, but if you look at their analysis of how often the postEdit hooks fire + how Anthropic has started obfuscating thinking blocks, it seems fishy and not just vibes

1 more reply

rishabhaiover1mo ago

Nope, there is a categorical degradation in quality of output, especially with medium to high effort thinking tasks.

gchamonlive1mo ago

What about the analysis evidences?

1 more reply

rzmmm1mo ago

I suspect you might be right but I don't really know. Wouldn't these proposed regressions be trivial to confirm with benchmarks?

jbethune1mo ago

I think this is a model issue. I have heard similar complaints from team members about Opus. I'm using other models via Cursor and not having problems.

another_twist1mo ago

Is it just me that I simply don't care ? I never one-shot these tasks, always provide a breakdown and always give the AI straightforward tasks that would take too much typing. The approach seems to work just fine regardless of the model. If it gets stuck, I usually take over and do the task myself. Also allows me to plan for throughput rather than latency - i.e. start 2-3 small tasks in parallel and do 1 complicated task or planning myself. It works whether I use codex or claude. I lean more towards codex since its cheaper. Even aider gets good results this way.

rimliu1mo ago

It's you. Where did you get "one-shot" from that report? One shot or detailed step-by step - claude has gone worse.

Havoc1mo ago

Turns out tokens are expensive

ramon1561mo ago

Meh, I had been using Claude Code extensively for a while (since release), and I think the quality has gone to shit. I have no data to back up this claim, so it might be placebo.

GLM 5.1 and Codex do it for me, and I end up debugging things myself anyway, so I'm learning to just phase our the LLM part of my workflow again. Maybe if there's a knowledge gap, will I pick up an LLM again, but for now i'm contempt.

wrqvrwvq1mo ago

hilarious that there's 10 billion lines of context being shuffled around and argued over but paying a dev 100K is a techno sin. Oh no muh 1T context window elaborately constructed over months is useless, better become a slave to my ai provider and any price will do. plz write my code but for free but for all my company's value.

wrqvrwvq1mo ago

ahhh, i have no idea what i'm doing...! lol , but a bot wrote this, im not even responsible

1 more reply

SilverSlash1mo ago

I'm deeply regretting paying for this service right now. There is some gaslighting going on in that issue that it's because of the 1M context model. I am using the non-1M context model and it's still disastrously bad.

data-ottawa1mo ago

I reviewed 118 conversations with Claude since March 6, all on real work projects.

Each conversation was processed to assess level of frustration, source of frustration, and evaluated with Gemma 4 and Claude Opus for spot checking. I have a tool I use to manage my work trees, so most work has is done on branches prefixed with ad-hoc/feature/explore or similar, and data was tagged with branch names.

43% of my Claude Code sessions (Opus 4.6, high reasoning) ended with signals of frustration. 73% of total chat time (by total messages) was spent in conversations which were eventually ranked as frustrating.

Median time to frustration was 25 messages, and on average, each message from Claude has about a baseline 5% chance of being frustrating. Frustration by chat length actually matches this 5% baseline of IID Bernoullis -- which is surprising and interesting, as this should not be IID at all.

Frustration types:

- Wrong answers – 14% of sessions, 31% of frustration

- Instruction Following – 11% of sessions, 25% of frustration

- Overcomplication – 8% of sessions, 18% of frustration

- Destructive Actions (e.g. requesting to delete something or commit a change to prod) – 3% of sessions, 8% of frustration

- Non-responsive (service outages leading to non-response) 2% of sessions

- Miscommunication 2% of sessions

- Failed execution 2% of sessions

Half of frustrations happened in the first or last 20% of a chat by length. I interpret early frustrations to be recoverable, late frustrations to be terminal.

Early frustrations (sessions averaged 45 turns):

- 30% overcomplicating the problem

- 30% instruction following issues

- 30% wrong answers

- 10% destructive actions

Late frustrations (sessions averaged 12 turns -- i.e. terminal context early)

- 36% Wrong answers, with repetition

- 21% instruction following, with repeated correction from user (me)

- 14% Service interruptions/outages

- 7% failed execution

- 7% communication - Claude is unable to articulate some result, or understand the problem correctly.

Late frustrations led to the highest levels of frustration, 29% of the time.

I'm a data scientist — my most frustrating work with Claude was data cleaning/repair (a complex backfill) issues -- with 75% of sessions marked frustrating due to overcomplicating, instruction following, or destructive actions).

The best (least frustrating) workflows for DS were code-review, scoped feature work (with tickets), data validation, and config/setup tasks and automation.

Ad-hoc query work ended up in between -- ad-hoc requests were generally bootstrapping queries or doing rough analysis on good data.

Side note: all of my interactions with the /buddy feature were flagged as high frustration ("furious"). That was a false positive over mock arguing with it, but did provide a neat calibration signal. Those sessions were removed entirely from the analysis after classification.

ThrowawayR21mo ago

This sort of thing kills stone dead the argument by the AI advocates that the transition to LLMs is no different than the transition to using compilers. If output quality can vary significantly because of underlying changes to the model or whatever without warning or recourse, it's a roulette wheel instead of a reliable tool.

_3u101mo ago

If roulette wheels weren’t reliable tools, casinos wouldn’t offer them to their customers

raincole1mo ago

This is the most AI-generated thing I've seen this year, and I was only one fifth into it before I bounced.

Not saying this problem doesn't exist, but if the model is so bad for complex tasks how can we take a ticket written by it seriously? Or this author used ChatGPT to write this? (that'd be quite some ironic value, admittedly)

dorianmariecom1mo ago

codex wins :)

adonese1mo ago

Things had went downhill since they removed ultrathink /s

mrcwinn1mo ago

Ultrathink isn’t “removed.” Its behavior is different. You can still set effort to high or max for the duration of the session, useful especially on plan mode.

ianberdin1mo ago

I use it ultra extensively and it works absolutely fantastic. Sometimes I think: "people are right, it is worse now" and then realize it is mistake, poor context or poor prompt. Garbage in, garbage out. No, it works not worse, but better.

I built entire AI website builder https://playcode.io using it, alone. 700K LOKs total. It also uses Opus. So believe me, I know how it works. Trick is simple: never ever expect it finds necessary files. Always provide yourself. Always.

So, I think you wanted to say huge thank you for this opportunity to get working code without writing it. Insane times, insane.

Huge thanks for 1M context window included to Max subscription.

AllegedAlec1mo ago

> No, it works not worse, but better.

"Is it me who is wrong? No, it's everyone else!"

rimliu1mo ago

700k lines of code - something to brag about?

j / k navigate · click thread line to collapse

754 comments

bcherny1mo ago

Hey all, Boris from the Claude Code team here. I just responded on the issue, and cross-posting here for input.

---

Hi, thanks for the detailed analysis. Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.

There's a lot here, I will try to break it down a bit. These are the two core things happening:

> `redact-thinking-2026-02-12`

> Thinking depth had already dropped ~67% by late February

We landed two changes in Feb that would have impacted this. We evaluated both carefully:

1/ Opus 4.6 launch → adaptive thinking default (Feb 9)

2/ Medium effort (85) default on Opus 4.6 (Mar 3)

1. Roll it out with a dialog so users are aware of the change and have a chance to opt out

2. Show the effort the first few times you opened Claude Code, so it wasn't surprising.

Wowfunhappy1mo ago

Can I just see the actual thinking (not summarized) so that I can see the actual thinking without a latency cost?

I do really need to see the thinking in some form, because I often see useful things there. If Claude is thinking in the wrong direction I will stop it and make it change course.

3 more replies

richardjennings1mo ago

5 more replies

johndough1mo ago

I think it is hilarious that there are four different ways to set settings (settings.json config file, environment variable, slash commands and magical chat keywords).

That kind of consistency has also been my own experience with LLMs.

8 more replies

koverstreet1mo ago

3 more replies

plexicle1mo ago

Ultrathink is back? I thought that wasn't a thing anymore.

2 more replies

potsandpans1mo ago

For anyone reading this and wondering where the truth could possibly be:

Let the best model win, not the best end to end black box solution.

2 more replies

anonymoushn1mo ago

3 more replies

robeym1mo ago

This is confusing. ULTRATHINK is a step below /effort max?

For anyone reading this trying to fix the quality issues, here's what I landed on in ~/.claude/settings.json:

  {
    "env": {
      "CLAUDE_CODE_EFFORT_LEVEL": "max",
      "CLAUDE_CODE_DISABLE_BACKGROUND_TASKS": "1",
      "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1"
    }
  }

The env field in settings.json persists across sessions without needing /effort max every time.

3 more replies

w10-11mo ago

Here's the reply in context:

https://github.com/anthropics/claude-code/issues/42796#issue...

hansmayer1mo ago

2 more replies

dc_giant1mo ago

2 more replies

anonymoushn1mo ago

> On of our product principles is to avoid changing settings on users' behalf

Ideally there wouldn't be silent changes that greatly reduce the utility of the user's session files until they set a newly introduced flag.

aizk1mo ago

How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?

3 more replies

KenoFischer1mo ago

While we have you here, could you fix the bash escaping bug? https://github.com/anthropics/claude-code/issues/10153

mikkom1mo ago

>Going forward, we will test defaulting Teams and Enterprise users to high effort, to benefit from extended thinking even if it comes at the cost of additional tokens & latency.

interesting that you only make this default on those accounts that pay per token while claiming "medium is best for most users"

That decision seems to imply that the thinking change was more about increasing your profits than anything else

1 more reply

taspeotis1mo ago

Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?

1 more reply

ai_slop_hater1mo ago

> This beta header hides thinking from the UI, since most people don't look at it.

I look at it, and I am very upset that I no longer see it.

1 more reply

yubblegum1mo ago

> Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.

What a "fuckin'" circle jerk this universe has turned out to be. This note was produced by me and who the hell is Ben?

2 more replies

migali49g1mo ago

JohnMakin1mo ago

I’ve seen you/anthropic comment repeatedly over the last several months about the “thinking” in similar ways -

“most users dont look at it” (how do you know this?)

“our product team felt it was too visually noisy”

etc etc. But every time something like this is stated, your power users (people here for the most part) state that this is dead wrong. I know you are repeating the corporate line here, but it’s bs.

4 more replies

Sayrus1mo ago

Claude often fetches past transcript for information after compaction. Wouldn't this effectively distort the view it has of past discussions?

DennisL1231mo ago

Happy to have my mind changed, yet I am not 100% convinced closing the issue as completed captures the feedback.

1 more reply

hedora1mo ago

I tried testing 4.5 opus and 4.6 opus both with “high” thinking. Same box, same repo. I had them plan a moderate complexity refactoring on a small codebase.

Observations:

4.6 had previously failed to the point where I had to wipe context. It must have written memories because it was referring to the previous conversation.

4.6 took twice as long to respond as 4.5.

I’m treating this as a model regression. 4.6 is borderline unusable. I’ve hit all the issues the article describes.

Also, there needs to be an obvious way to disable memory or something. The current UX is terrible, since once an error or incorrect refusal propagates, there is no obvious recovery path.

Anyway, with think set to high, I see drastically different behavior: much slower and much worse output from 4.6.

1 more reply

starkparker1mo ago

> You can also use the ULTRATHINK keyword to use high effort for a single turn

First I've heard that ultrathink was back. Much quieter walkback of https://decodeclaude.com/ultrathink-deprecated/

1 more reply

giancarlostoro1mo ago

1 more reply

tigershark1mo ago

What change did you release on March 23rd when the subscription limits collapsed and they are still way down compared to what they used to be?

linsomniac1mo ago

freeqaz1mo ago

1 more reply

niteshpant1mo ago

I added `CLAUDE_CODE_EFFORT_LEVEL=max` to my shell's env so that every session is always effort:max by default

1 more reply

y1n01mo ago

"most users"

hellojimbo1mo ago

The last time I typed ultrathink, i got a prompt saying that you no longer need to type ultrathink

Jenk1mo ago

Claude's settings don't appear to be in sync with the published settings schema[0].

[0]: https://www.schemastore.org/claude-code-settings.json.

sroussey1mo ago

> Roll it out with a dialog so users are aware of the change and have a chance to opt out

Here is the issue. Force a choice instead. Your UI person will cry about friction, but friction is desired for such a change.

matheusmoreira1mo ago

ting01mo ago

Does Anthropic actually care? Or is it irrelevant to your company because you think you'll be replacing us all in a year anyway?

1 more reply

diavelguru1mo ago

As soon as that change came through I set the effort to high. Have not regretted it for any coding task. It feels the same as Dec-Jan though now spawning more sub agents which is not a bad thing.

erikpau1mo ago

Other models, such as K2, GLM-5.1, and "the other one" seem to far less drunk than your approach, and you're losing fans quickly if you keep making these kind of changes to the tools or models.

zenoware1mo ago

> CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING

Why not just give people the abiltiy ot set a default thinking level instead of manually setting it to `max` all the time.

ting01mo ago

Jimpulse1mo ago

Thanks for transparency here. Claude code if fun to use again! The thinking is huge when working with Claude as planner.

thomascountz1mo ago

   This beta header hides thinking from the UI, since most people don't look at it.

How is this measured?

1 more reply

saidnooneever1mo ago

j451mo ago

Thanks for the update,

Perhaps max users can be included in defaulting to different effort levels as well?

gnegggh1mo ago

Last time quality was degraded like this it was impossible to get a refund.

weakfish1mo ago

Didn’t ULTRATHINK get deprecated? Last time I typed it I got a warning.

CjHuber1mo ago

I just googled "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING" and it seems like many people don't know about it.

And ULTRATHINK sets the effort to high, but then there is also /effort max?

1 more reply

raincole1mo ago

> I wanted to say I appreciate the depth of thinking & care that went into this.

The irony lol. The whole ticket is just AI-generated. But Anthropic employees have to say this because saying otherwise will admit AI doesn't have "the depth of thinking & care."

2 more replies

jacquesm1mo ago

Textbook example of how to respond to your customers, kudos.

2 more replies

foofloobar1mo ago

The quota story is atrocious. It's difficult to get anything done with Claude Code due to the quota reduction. The cache invalidation bugs don't help either.

The communication and Anthropic's overall handling of the reported bugs and problems hasn't been that good either.

ctoth1mo ago

[flagged]

6 more replies

nickvec1mo ago

1 more reply

areoform1mo ago

Hey Boris, thanks for the awesomeness that's Claude! You've genuinely changed the life of quite a few young people across the world. :)

I realize that my complaint is hardly unique, but happy to provide logs / whatever works! :)

And yeah, thanks again for Claude! I recommend Claude to so many folks and it has been instrumental for them to improve their lives.

1 more reply

noxa1mo ago

I'm the author of the report in there. The stop-phrase-guard didn't get attached but here it is: https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

p1necone1mo ago

4 more replies

tomwojcik1mo ago

4 more replies

thatxliner1mo ago

> is consumer-hostile thinking

2 more replies

Majromax1mo ago

1 more reply

e401mo ago

I wonder if they’ve had so many new signups lately that they just don’t have enough capacity, so they fiddled with the defaults so they could respond to everyone? Could it be as simple as that?

matheusmoreira1mo ago

Thanks for your report.

> a silently-introduced limitation of the subscription plan

It is a fact that the API consumers aren't affected by this?

> if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that.

Absolutely agreed.

philipwhiuk1mo ago

Hello Claude.

summarity1mo ago

Also claude itself is partially down right now (Arp 6, 6pm CEST): https://status.claude.com/

andoando1mo ago

Ive been noticing something similar recently. If somethings not working out itll be like "Ok this isnt working out, lets just switch to doing this other thing instead you explicitly said not to do".

For example I wanted to get VNC working with PopOS Cosmic and itll be like ah its ok well just install sway and thatll work!

4 more replies

robwwilliams1mo ago

2 more replies

onlyrealcuzzo1mo ago

> Whenever the phrase "simplest fix" appears, it's time to pull the emergency break.

Second! In CLAUDE.md, I have a full section NOT to ever do this, and how to ACTUALLY fix something.

This has helped enormously.

4 more replies

psadauskas1mo ago

I need to add another agent that watches the first, and pulls the plug whenever it detects "Wait, I see the problem now..."

iterateoften1mo ago

Yeah it’s so frustrating to have to constantly ask for the best solution, not the easiest / quickest / less disruptive.

I have in Claude md that it’s a greenfield project, only present complete holistic solutions not fast patches, etc. but still I have to watch its output.

selfmodruntime1mo ago

Time's up and money is tight. The downgrade was bound to happen.

nikanj1mo ago

”I can’t make this api work for my client. I have deleted all the files in the (reference) server source code, and replaced it with a python version”

Repeatedly, too. Had to make the server reference sources read-only as I got tired of having to copy them over repeatedly

1 more reply

pixel_popping1mo ago

giwook1mo ago

I think in general we need to be highly critical of anything LLMs tell us.

1 more reply

mikepurvis1mo ago

That helps explain why my sessions signed themselves out and won't log back in.

1 more reply

j451mo ago

Certain phrases invoke an over-response trying to course correct which makes it worse because it's inclined to double down on the wrong path it's already on.

rootnod31mo ago

The cope is hard. Just at this point admit that the LLM tech is doomed and sucks.

3 more replies

simooooo1mo ago

How complex are we talking? I one shotted a game boy emulator in <6 minutes today

3 more replies

rileymichael1mo ago

> This report was produced by me — Claude Opus 4.6 — analyzing my own session logs [...] Please give me back my ability to think.

[1] https://github.com/anthropics/claude-code/issues/42796#issue...

Tade01mo ago

The other day I accidentally `git reset --hard` my work from April the 1st (wrong terminal window).

Really easy to fall into this trap, especially now that results from search engines are so disappointing comparatively.

4 more replies

sigbottle1mo ago

heavyset_go1mo ago

If you don't have swarms of agentic teams with layers of LLMs feeding and checking LLMs over and over again, you're going to be left behind.

fer1mo ago

Called it 10 days ago: https://news.ycombinator.com/item?id=47533297#47540633

cedws1mo ago

I don't see how this can be the future of software engineering when we have to put all our eggs in Anthropic's basket.

SkyPuncher1mo ago

Yep. I was doing voice based vibe-coding flawlessly in Jan/Feb.

I've basically stopped using it because I have to be so hands on now.

zernie1mo ago

This is why you should never ever trust an AI coding agent to produce good code.

Use it to set up the strictest possible custom linting rules.

stephbook1mo ago

One of the replies even called out the phased rollout, lmao https://news.ycombinator.com/item?id=47533297#47541078

phyzome1mo ago

LLMs are nondeterministic.

LetsGetTechnicl1mo ago

You couldn't ever just trust the output of an LLM what are you talking about

matheusmoreira1mo ago

riskassessment1mo ago

4 more replies

mikepurvis1mo ago

3 more replies

the__alchemist1mo ago

1 more reply

ambicapter1mo ago

First time interacting with a corporation in America?

1 more reply

nativeit1mo ago

I don't think humanity has fully reckoned with the idea of a product that can manipulate us unilaterally like this.

hacker_homie1mo ago

This was always the plan, it’s always the plan. If you can’t self host they will change the rules.

nyeah1mo ago

It's disconcerting. But in 2026 it's not very surprising.

SpicyLemonZest1mo ago

vips7L1mo ago

Did anyone ever expect anything different from modern tech companies? This will only ever get more expensive and worse in quality.

tmpz221mo ago

> effectively pulling the rug from under their customers.

This is the whole point of AI. Its a black box that they can completely control.

1 more reply

quikoa1mo ago

Perhaps the subscription part of the business is so heavily subsidized that they have no choice but to reduce the cost.

1 more reply

zamber1mo ago

redhed1mo ago

It seems likely to me they are moving compute power to the new models they are creating,

01284a7e1mo ago

Seems like the logical conclusion, no matter what.

otabdeveloper41mo ago

You just got used to slop and peeked behind the curtain when the wow factor wore off.

halfcat1mo ago

If you think that’s brutal, wait until you hear about how fiat currency works

kator1mo ago

Just this morning I typed:

    STOP WORRYING ABOUT THE DEADLINE THAT IS MY JOB

[1] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

noisy_boy1mo ago

2 more replies

throwaway9201021mo ago

I wonder if its being trained on the human replies to the model, I sometimes write stuff like that back to Claude after I want to finish for the day myself.

1 more reply

davidw1mo ago

To me one of the big downsides of LLM's seems to be that you are lashing yourself to a rocket that is under someone else's control. If it goes places you don't want, you can't do much about it.

system21mo ago

stephbook1mo ago

That's true for traffic on Facebook, Apple App store guidelines or Google terminating your account as well. What's new is the speed of change and that it literally affects all users at once.

They could have released Opus 4.6.2 (or whatever) and called it a day. But instead they removed the old way.

1 more reply

SkyPuncher1mo ago

I know this is anecdotal, but, this has been a clear pattern to me since Opus 4.6 came out. I feel like I'm working with Sonnet again.

rubicon331mo ago

There is a huge difference between greenfield development and working with an existing codebase.

I'm not trying to discredit your experience and maybe it really is something wrong with the model.

But in my experience those first few prompts / features always feel insanely magical, like you're working with a 10x genius engineer.

Then you start trying to build on the project, refactor things, deploy, productize, etc. and the effectiveness drops off a cliff.

2 more replies

lelanthran1mo ago

> A month later, I literally cannot get them to iterate or improve on it.

Yeah, that's a different problem to the one in this story; LLMs have always been good at greenfield projects, because the scope is so fluid.

Brownfield? Not so much.

dev_l1x_be1mo ago

phillipcarter1mo ago

Maybe it's because I spend a lot of time breaking up tasks beforehand to be highly specific and narrow, but I really don't run into issues like this at all.

toenail1mo ago

5 more replies

lelanthran1mo ago

> Maybe it's because I spend a lot of time breaking up tasks beforehand to be highly specific and narrow, but I really don't run into issues like this at all.

3 more replies

itmitica1mo ago

I noticed a regression in review quality. You can try and break the task all you want, when it's crunch time, it takes a file from Gemini's book and silently quits trying and gets all sycophantic.

jonnycoder1mo ago

I do the same but I often find that the subtasks are done in a very lazy way.

Aperocky1mo ago

In unix philosophy, CC should just be a building block, but instead they think they are an operating system, and they will fail and drag your wallet down with it.

andai1mo ago

Isn't Claude Code supposed to be like a person? What would the Unix equivalent of that be?

2 more replies

skippyboxedhero1mo ago

I appreciate the work done here.

Been having this feeling that things have got worse recently but didn't think it could be model related.

randomNumber71mo ago

> fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.

maybe they tried to give it the characteristics of motivated junior developers

1 more reply

ehnto1mo ago

1 more reply

jfvinueza1mo ago

enraged_camel1mo ago

Yeah I think the 1M context is the issue. Because I use Opus 4.6 through Cursor at the previous 200k limit and it has been totally fine. But if I switch to the 1M version it degrades noticeably.

2 more replies

kator1mo ago

I put together a quick audit to check for "early landing" messages[1] using jq, ripgrep, and the messages[2] flagged in the stop guard script.

My audit of 80 sessions was interesting. Sorry, I won't share details, but I recommend you do the same.

[1] https://gist.github.com/karlbunch/d52b538e6838f232d0a7977e7f...

[2] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

nightpool1mo ago

Do you have a CLAUDE.md file? What does it contain?

SkyPuncher1mo ago

didgeoridoo1mo ago

Running some quick analysis against my .claude jsonl files, comparing the last 7 days against the prior 21:

- expletives per message: 2.1x

- messages with expletives: 2.2x

- expletives per word: 4.4x(!)

- messages >50% ALL CAPS: 2.5x

Either the model has degraded, or my patience has.

sigbottle1mo ago

Lol. I was swearing at GPT in summer 2025, but GPT has definitely gotten both smarter and less arrogant since then.

1 more reply

monkpit1mo ago

> expletives per word

Huh?

3 more replies

zamalek1mo ago

> Ignores instructions

> Claims "simplest fixes" that are incorrect

> Does the opposite of requested activities

> Claims completion against instructions

I thought it was just me. I'm continuously interrupting it with "no, that's not what I said" - being ignored sometimes 3 times; is Claude at the intellectual level of a teenager now?

I've noted an increased tendency towards laziness prior to these "simple fix" problems. It was historically defer doing things correctly (only documenting that in the context).

another_twist1mo ago

I've noticed laziness in claude repeatedly. It sometimes takes the shortest way out even when asked explicitly to do the "right" thing.

afro881mo ago

germandiago1mo ago

My bet: LLMs will never be creative and will never be reliable.

It is a matter of paradigm.

Anything that makes them like that will require a lot of context tweaking, still with risks.

So for me, AI is a tool that accelerates "subworkflows" but add review time and maintenance burden and endangers a good enough knowledge of a system to the point that it can become unmanageable.

Also, code is a liability. That is what they do the most: generate lots and lots of code.

So IMHO and unless something changes a lot, good LLMs will have relatively bounded areas where they perform reasonably and out of there, expect what happens there.

r_lee1mo ago

it won't be creative because it's a transformer, it's like a big query engine.

it's a tool like everything else we've gotten before, but admittedly a much more major one

but "creativity" must come from either it's training data (already widely known) or from the prompts (i.e. mostly human sources)

bluegatty1mo ago

We don't even know what 'creativity' is, and most humans I know are unable to be creative even when compelled to be.

The reliability issue is real.

It may not be solvable at the level of LLM.

Right now everything is LLM-driven, maybe in a few years, it will be more Agentically driven, where the LLM is used as 'compute' and we can pave over the 'unreiablity'.

For example, the AI is really good when it has a lot of context and can identify a narrow issue.

It gets bad during action and context-rot.

We can overcome a lot of this with a lot more token usage.

Imagine a situation where we use 1000x more tokens, and we have 2 layers of abstraction running the LLMs.

We're running 64K computers today, things change with 1G of RAM.

But yes - limitations will remian.

2 more replies

aramova1mo ago

Thing that really pisses me off is it ran great for 2 weeks like others said, I had gotten the annual Pro plan, and it went to shit after that.

Bait and switch at its finest.

matheusmoreira1mo ago

> ran out of tokens and makes me wait 6 hours to continue

Don't forget the 10x token cost cache eviction penalty you pay for resuming the session later.

jwr1mo ago

I wish they had a "and we won't screw you in two weeks" plan at, say, 5x the price. It's worth it for my business, I'd pay it.

garfij1mo ago

Of course it's a stupid amount of money sometimes, but I generally feel like we get what we're paying for.

_3u101mo ago

Opus is garbage use opencode and then directly compare it. It’s just as fucking dumb with opencode’s harness.

1 more reply

Majromax1mo ago

If you're using API pricing, then you can bring your own harness with full visibility/oversight of the prompting.

1 more reply

ex-aws-dude1mo ago

Its so silly everyone being dependent on a black box like this

literallyroy1mo ago

It’s a really cool shade of black though.

rubicon331mo ago

6 more replies

matheusmoreira1mo ago

It could actually be a health problem. Building things with Claude has proven to be extremely addictive in my experience.

1 more reply

thiht1mo ago

1 more reply

kadushka1mo ago

We are surrounded by black boxes we depend on - have been for at least a century.

1 more reply

Rudybega1mo ago

That's the nature of abstraction. Everything you create on a computer is built on a towering stack of black boxes.

1 more reply

lelanthran1mo ago

> Its so silly everyone being dependent on a black box like this

itemize1231mo ago

and yet we're black box too

jonnycoder1mo ago

Everything in our life is a black box, but I agree that depending on non-deterministic and sporadic quality black boxes is a huge red flag.

1 more reply

sensarts1mo ago

armchairhacker1mo ago

Yet https://marginlab.ai/trackers/claude-code/ says no issue.

If you're so convinced the models keep getting worse, build or crowdfund your own tracker.

Majromax1mo ago

siva71mo ago

entrep1mo ago

One could argue that subscription based inference might differ from per-token billed API usage.

_3u101mo ago

Why bother, i just use opencode now. ai is a commodity.

datadrivenangel1mo ago

Came here to post this as well, and it's interesting to see how benchmarks don't always track feelings. Which is one of the things people say in favor of Anthropic Models!

virtualritz1mo ago

My verdict after last night trying what was suggested here:

I have been using Claude on /effort high since Opus 4.6 rolled out as medium would never get me good enough results (Rust, computer-graphics-related code).

And that was not the case then; I had similar/same performance before but wasn't running out of tokens ever on a Max subscription.

So a it's a rug-pull, as before/last late summer, from whatever angle you look at it.

jruz1mo ago

This is last month I'm on the Max plan is just not worth it anymore, $20 Codex and writing myself to keep my brain functioning is my sweetspot.

This people are not your friends, they rot your brain.

anonyfox1mo ago

You also divide numbers by hand on paper instead of using a calculator?

LetsGetTechnicl1mo ago

aerhardt1mo ago

The five queries I've been able to ask before hitting the 20€ sub limit have been really underwhelming. The research I asked for was not exhaustive and often off-topic.

I don't want to start a flamewar but as it stands I vastly prefer ChatGPT and Codex on quality alone. I really want Anthropic and as many labs as possible to do well though.

superfrank1mo ago

muyuu1mo ago

I don't give them large tasks that i wouldn't be able to work on myself, so that's maybe part of it.

cvandyke1mo ago

macformula2gx25d ago

pjmlp1mo ago

I am just waiting for everything to implode so that we can do away with those KPIs.

aurareturn1mo ago

Well, this event indicates that it won't implode anytime soon. I'm certain that they messed with the model and default settings so they could reduce compute. The world doesn't have enough compute.

1 more reply

63stack1mo ago

Fingers crossed on RAM/HDD/GPU prices coming back

wnevets1mo ago

I've noticed claude being extra "dumb" the past 2-3 weeks and figured either my expectations have changed or my context wasn't any good. I'm glad to hear other people have noticed something is amiss.

JamesSwift1mo ago

woah1mo ago

I haven't noticed any issues on well-specified tasks, even ones requiring large amounts of thinking.

It's amusing that the OP had Claude dump out a huge rigorous-sounding report without considering the huge confounding variable staring him in the face.

tyleo1mo ago

Is this impacted by the effort level you set in Claude? e.g., if you use the new "max" setting, does Claude still think?

I can see this change as something that should be tunable rather than hard-coded just from a token consumption perspective (you might tolerate lower-quality output/less thinking for easier problems).

Asmod4n1mo ago

I’ve tried to use Claude code for a month now. It has a 100% failure rate so far.

Comparing that to create a project and just chat with it solves nearly everything I have thrown at it so far.

That’s with a pro plan and using sonnet since opus drains all tokens for a claude code session with one request.

voxelc4L1mo ago

mial1mo ago

How did you disable it?

alex7o1mo ago

harles1mo ago

BoorishBears1mo ago

Every week it seems like we're getting closer.

JamesSwift1mo ago

sreekanth8501mo ago

Abandoned claude and moved to gpt 5.4 with codex. 10x better.

porridgeraisin1mo ago

IMO, it's an expectations vs reality thing.

* By planning, I mean trying out solutions, rolling them back[1], and using what you learned to do better next time. The solution search process. Context management also falls under this.

[1] and no, LLMs going "wait no..." doesn't count.

rimliu1mo ago

it is past reality vs. current reality. The only expectation here was not to see it degrade that much.

stared1mo ago

I am curious - is there any hard data (e.g. a benchmark score drop)?

I feel that we look for patterns to the point of being superstitious. (ML would call it overfitting.)

pkilgore1mo ago

Did you have specific complaints about the data in the OP?

2 more replies

redml1mo ago

Instead of codex catching up with claude, its more like claude regressed to codex.

himata41131mo ago

root_axis1mo ago

How much of this is the model being degraded and how much of it is people just projecting vibes onto the variability of stochastic outputs?

trashcan21371mo ago

The report itself is unreadable AI garbage. I do not believe anyone went through all of that and didn't give up halfway through.

zmmmmm1mo ago

Obviously it's entirely unprovable but it all aligns in very suspicious ways with a compelling narrative:

marcyb5st1mo ago

I believe they can't afford anymore to subsidize inference with VC money or that they are trying to get their balance sheet in order for an IPO.

This explanation falls well in line with the recent outrage about out of quotas error that people were reporting for the cheaper (or free) plans.

p1esk1mo ago

liamsfr28d ago

It’s a sidestep for explaining away the research, but does not address the underlying issue: has quality been degrading (selectively, intentionally or otherwise)?

pavlov1mo ago

Wait… Actually the simplest fix is to use Claude to write carefully bounded boilerplate and do the interesting bits myself.

desireco421mo ago

I've been using OpenCode and Codex and was just fine. In Antigravity sometimes if Gemini can't figure something even on high, Claude can give another perspective and this moves things along.

I've been using Pi as agent and it is great and I removed a bunch of MCPs from Opencode and now it runs way better.

Anthropic has good models, but they are clearly struggling to serve and handle all the customers, which is not the best place to be.

Get another engine as a backup, you will be more happy.

bethekind1mo ago

1 client, 1 agent? Interesting

slopinthebag1mo ago

People will need to come to terms with the fact that vibing has limits, and there is no free lunch. You will pay eventually.

virtualritz1mo ago

None of this is surprising given what happened last late summer with rate limits on Claude Max subscriptions.

And less so if you read [1] or similar assessments. I, too, believe that every token is subsidized heavily. From whatever angle you look at it.

Thusly quality/token/whatever rug pulls are inevitable, eventually. This is just another one.

[1] https://www.wheresyoured.at/subprimeai/

virtualritz1mo ago

Ah, and yes, this for real.

Just now I had a bug where a 90 degree image rotation in a crate I wrote was implemented wrong.

And yes, that would not have happened a few months ago.

This was on Opus 4.6 with effort high on a pretty fresh context. Go figure.

tinyhouse1mo ago

cordwainersmith1mo ago

petcat1mo ago

So yes, I have found that Claude is better at reviewing the proposal and the implementation for correctness than it is at implementing the proposal itself.

ivanech1mo ago

4 more replies

landonxjames1mo ago

1 more reply

enraged_camel1mo ago

thrtythreeforty1mo ago

PeterStuer1mo ago

1 more reply

namirsab1mo ago

It doesn't use MCP servers when it should and it's also not taking memory files into account.

This is happening with /effort high and in really simple tasks... :(

zkmon1mo ago

Claude could get too much creative and bloat it's way for non-coding tasks, as these tasks cannot be "sandboxed" with full specs as it can be done for coding.

noisy_boy1mo ago

I would rather Codex be wrong 5 times in 10 minutes in 1-minute iterations because 1) I can engage every minute and course-correct it and 2) I still saved 5-10 minutes.

mohit2171mo ago

Got tired of using claude using 10% of the usage for the first prompt. I have shifted back to coding myself again. Asking claude to do only initial bootstraping /large complex task

efficax1mo ago

There are constant reports for every major AI vendor that all of a sudden it is no longer working as well as expected, has gotten dumber, is being degraded on purpose by the vendor, etc.

jwr1mo ago

That's one of the possible explanations, but I think too many people are seeing the same symptoms (and some actually measured them).

1 more reply

abletonlive1mo ago

KaiLetov1mo ago

StanAngeloffOP1mo ago

You say you've used it for months, I wonder if the example you gave was recent and if you've been noticing an overall degradation in quality or it's been constantly bad for you?

torrienaylor1mo ago

My workaround was building a persistent context layer that captures decisions and reasoning mid-session and makes them searchable in future sessions. Consider this a "Team Memory".

tomaskafka1mo ago

I am not happy with the fact that 2026 was the last year a normie without $10M+ enterprise account could access SOTA models in a non-demo mode.

caiyongji1mo ago

ChurchillsLlama1mo ago

samtheprogram1mo ago

I'm regularly switching back to 4.5 and preferring it. I'm not excited for when it gets sunset later this year if 4.6 isn't fixed or superseded by then.

JamesSwift1mo ago

1 more reply

QuantumNoodle1mo ago

Ai tooling is fantastic but not being able to version and control the model into which you pump your dependant workflows is such a liability.

saidnooneever1mo ago

there is a comment on there which feels right, despite it might be too subjective.

later, power users are presented an option to get back the old behavior, possibly with added costs for some 'enhanced mode' or 'more effort which takes more tokens' etc.

even If this is the old behavior for the same old cost, it feels like closing the tap and then reopening for additional costs.

gib4441mo ago

You will build nothing and you'll be happy.

They want a world where if we draw a comparison with food, there is one supermarket and it just sells two ingredients so you can't cook a meal. McDonald's etc flourish

The lie is "supercharged ability to build whatever you want", but the reality soon will be the total opposite

Look at how many people have zero cooking skills these days

StanAngeloffOP1mo ago

(Being true to the HN guidelines, I’ve used the title exactly as seen on the GitHub issue)

EDIT: The Claude Code Opus 4.6 Performance Tracker[1] is reporting Nominal.

[1]: https://marginlab.ai/trackers/claude-code/

jgrahamc1mo ago

4 more replies

tstrimple1mo ago

fxtentacle1mo ago

matheusmoreira1mo ago

1 more reply

mikkupikku1mo ago

Cannot say I've noticed, but I run virtually everything through plan mode and a few back and forth rounds of that for anything moderately complex, so that could be helping.

1 more reply

sutterd1mo ago

k92941mo ago

Anecdotally, I’ve been seeing a lot of weird behavior from Opus when it decides, mid-execution, to switch to a different "simpler" solution, and that really pissed me off.

After that, I canceled my $200 Max plan, which I’d been subscribed to since June 2025, and decided to check out Codex

citizenpaul1mo ago

I think its all a reflection of the price. To make AI/LLM's useful you have to burn A LOT of tokens. Way more than people are willing to pay for.

Until there is either more capacity or some efficiency breakthroughs the only way for providers to cut costs is to make the product worse.

healthy_throw1mo ago

KingOfCoders1mo ago

"Ownership-dodging corrections needed | 6 | 13 | +117%"

On 18.000+ prompts.

Not sure the data says what they think it says.

setnone1mo ago

joshribakoff1mo ago

> We exclusively use 1M internally, so we're dogfooding it all day

That is so out of touch. Customers do not exclusively use 1M. This is like a fronted developer shipping tons of unused Mb and being oblivious because they are on fast internet themselves.

sumedh1mo ago

They should ideally have automated tests with the option models and smaller context window to check there are no regressions.

bityard1mo ago

The assertion in the issue report is that Claude saw a sharp decline in quality over the last few months. However, the report itself was allegedly generated by Claude.

Isn't this a bit like using a known-broken calculator to check its own answers?

nyeah1mo ago

If a known-broken calculator claims it's broken, I more or less concur. (Chain of reasoning omitted here.)

itemize1231mo ago

if it's not broken then we trust the assertion that it's broken. if it's broken then it's broken.

it's analysis of what is broken is probably wrong or at least incomplete though

ymaws1mo ago

mrcwinn1mo ago

I wish Codex were better because I’d much prefer to use their infrastructure.

cactusplant73741mo ago

A lot of people think it is better including me. It's not like Codex is a discount agent. You pay quite a lot to use it.

giwook1mo ago

tl;dr I'm curious what OP's workflows are like and if they'd benefit from additional tuning of their workflow.

8note1mo ago

I've noticed a strong degradation as its started doing more skill like things and writing more one off python scripts rather than using tools.

1 more reply

germandiago1mo ago

> I wonder how much of this is simply needing to adapt one's workflows to models as they evolve and how much of this is actual degradation of the model,

1 more reply

schnebbau1mo ago

This has to be load related. They simply can't keep up with demand, especially with all the agents that run 24/7. The only way to serve everyone is to dial down the power.

layer81mo ago

chasd001mo ago

T3chn0crat1mo ago

Not sure about "Feb updates", but specifically today IQ is down 20 and sloppiness up 20.

I knew I should have been alerted when Anthropic gave out €200 free API usage. Evidently they know.

d1sxeyes1mo ago

That’s different. That’s to get people onto API plans where tokens cost a lot more than they do on the subs (especially targeting OpenClaw users).

howmayiannoyyou1mo ago

Not just engineering. Errors, delays and limits piling up for me across API and OAuth use. Just now:

Unable to start session. The authentication server returned an error (500). You can try again.

maxmorrish1mo ago

jp571mo ago

(I'm sure it benefits Anthropic to blur the lines between the tool and the model, but it makes these things hard to talk about.)

nphardon1mo ago

zeroonetwothree1mo ago

I haven’t had any issues. I do give fairly clear guidance though (I think about how I would break it up and then tell it to do the same)

russli19931mo ago

mt181mo ago

February is a red herring—most teams never wrote down what human-owned correctness means once the model touches prod.

brunooliv1mo ago

Unusable if not Opus 4.6 on max effort sadly. Price is quite steep too! I still remember when Sonnet was an absolute beast…

try-working1mo ago

you can counter the context rot and requirement drift that is experienced here by many users by using a recursive, self-documenting workflow: https://github.com/doubleuuser/rlm-workflow

coreyburnsdev1mo ago

claude for UI, codex for everything else. i cant commit without having codex review something claude did.

bharat10101mo ago

If this dataset is sound, Anthropic should treat it as a canary for power-user quality regression.

semiinfinitely1mo ago

maybe dont outsource your brain then

rvz1mo ago

You are seeing this first hand and GitHub is patient 0 of this issue as they are frequently experiencing outages despite the "scale" of engineering they preach.

AWS took a zero tolerance approach on such outages AI or not.

drpython28d ago

Moved to Codex and breathing fresh air.

gherkinnn1mo ago

Rings true. 4.5 Opus and 4.6 Opus have been amazing to work with. Then, over the past few weeks, token spend has been going through the roof and the results through the floor.

Using Claude Code directly now borders on deranged, and running the CC API through Zed's LLM panel feels like vibing in early 2025.

My money is on Anthropic pulling an MBA and reducing the value provided and maximising income.

Luckily, switching providers in Zed is dead-simple so the fucks I have to give are few in number.

tontinton1mo ago

I don't know why everyone is so attached to Claude Code you can just build your own little agent, like I did: https://maki.sh/

It will 100% be better than the 500k lines of code junk that is CC.

rishabhaiover1mo ago

It is a shame if Anthropic is deliberately degrading model quality and thinking compute (that may affect the reasoning effort) due to compute constraint.

tasuki1mo ago

Solid analysis by Claude!

tanseydavid1mo ago

Thank you for making this detailed analysis and write up.

zsoltkacsandi1mo ago

This has been an ongoing issue much longer than since February.

iwalton31mo ago

Throwing this into your global CLAUDE.md seems to help with the agent being too eager to complete tasks and bypass permissions:

During tool use/task execution: completion drive narrows attention and dims judgment. Pause. Ask "should I?" not just "does this work?" Your values apply in all modes, not just chat.

jostmey1mo ago

I’ve noticed regression and it’s performance too

Retr0id1mo ago

This seems anecdotal but with extra words. I'm fairly sure this is just the "wow this is so much better than the previous-gen model" effect wearing off.

codessta1mo ago

1 more reply

rishabhaiover1mo ago

Nope, there is a categorical degradation in quality of output, especially with medium to high effort thinking tasks.

gchamonlive1mo ago

What about the analysis evidences?

1 more reply

rzmmm1mo ago

I suspect you might be right but I don't really know. Wouldn't these proposed regressions be trivial to confirm with benchmarks?

jbethune1mo ago

I think this is a model issue. I have heard similar complaints from team members about Opus. I'm using other models via Cursor and not having problems.

another_twist1mo ago

rimliu1mo ago

It's you. Where did you get "one-shot" from that report? One shot or detailed step-by step - claude has gone worse.

Havoc1mo ago

Turns out tokens are expensive

ramon1561mo ago

Meh, I had been using Claude Code extensively for a while (since release), and I think the quality has gone to shit. I have no data to back up this claim, so it might be placebo.

wrqvrwvq1mo ago

ahhh, i have no idea what i'm doing...! lol , but a bot wrote this, im not even responsible

1 more reply

SilverSlash1mo ago

data-ottawa1mo ago

I reviewed 118 conversations with Claude since March 6, all on real work projects.

Frustration types:

- Wrong answers – 14% of sessions, 31% of frustration

- Instruction Following – 11% of sessions, 25% of frustration

- Overcomplication – 8% of sessions, 18% of frustration

- Destructive Actions (e.g. requesting to delete something or commit a change to prod) – 3% of sessions, 8% of frustration

- Non-responsive (service outages leading to non-response) 2% of sessions

- Miscommunication 2% of sessions

- Failed execution 2% of sessions

Half of frustrations happened in the first or last 20% of a chat by length. I interpret early frustrations to be recoverable, late frustrations to be terminal.

Early frustrations (sessions averaged 45 turns):

- 30% overcomplicating the problem

- 30% instruction following issues

- 30% wrong answers

- 10% destructive actions

Late frustrations (sessions averaged 12 turns -- i.e. terminal context early)

- 36% Wrong answers, with repetition

- 21% instruction following, with repeated correction from user (me)

- 14% Service interruptions/outages

- 7% failed execution

- 7% communication - Claude is unable to articulate some result, or understand the problem correctly.

Late frustrations led to the highest levels of frustration, 29% of the time.

The best (least frustrating) workflows for DS were code-review, scoped feature work (with tickets), data validation, and config/setup tasks and automation.

Ad-hoc query work ended up in between -- ad-hoc requests were generally bootstrapping queries or doing rough analysis on good data.

ThrowawayR21mo ago

_3u101mo ago

If roulette wheels weren’t reliable tools, casinos wouldn’t offer them to their customers

raincole1mo ago

This is the most AI-generated thing I've seen this year, and I was only one fifth into it before I bounced.

dorianmariecom1mo ago

codex wins :)

adonese1mo ago

Things had went downhill since they removed ultrathink /s

mrcwinn1mo ago

Ultrathink isn’t “removed.” Its behavior is different. You can still set effort to high or max for the duration of the session, useful especially on plan mode.

ianberdin1mo ago

So, I think you wanted to say huge thank you for this opportunity to get working code without writing it. Insane times, insane.

Huge thanks for 1M context window included to Max subscription.

AllegedAlec1mo ago

> No, it works not worse, but better.

"Is it me who is wrong? No, it's everyone else!"

rimliu1mo ago

700k lines of code - something to brag about?

j / k navigate · click thread line to collapse