Is Opus 4.7 a Downgrade? (opens in new tab)

(vincentschmalbach.com)

8 pointsvincent_s1mo ago6 comments

6 comments

6 comments · 3 top-level

bisonbear1mo ago· 2 in thread

I'm actually currently working on benchmarking the opus 4.7 reasoning curve against real-world tasks, and have found that reasoning effort does not seem to monotonically improve results (at least on the slice I'm looking at). I've been puzzling about this but perhaps the fact that claude code has adaptive thinking explains some of it - even at medium reasoning effort, it can use more thinking tokens when needed to solve a complex problem.

Snapshot of the results (sorry for busted format, ask your llm for dataviz. cant seem to format a good table in the comments)

Opus 4.7 on GraphQL-go-tools:

Low: 23/29 pass, 10/29 equivalent, 5/29 review-pass, custom avg 2.598, $2.50/task, 384s/task

Medium: 28/29 pass, 14/29 equivalent, 10/29 review-pass, custom avg 2.759, $3.15/task, 451s/task

High: 26/29 pass, 12/29 equivalent, 7/29 review-pass, custom avg 2.670, $5.01/task, 716s/task

Xhigh: 25/29 pass, 11/29 equivalent, 4/29 review-pass, custom avg 2.669, $6.51/task, 804s/task

Max: 27/29 pass, 13/29 equivalent, 8/29 review-pass, custom avg 2.690, $8.84/task, 997s/task

(custom avg is a set of rubrics used for llm-as-a-judge, graded out of 4)

Practically, the results indicate that medium has better outcomes, or at least the same outcomes, considering variance, as higher reasoning efforts, at a much lower cost/time.

vincent_sOP1mo ago

The effort parameter in Claude Code is essentially useless. It’s just an expression that you wish it to do deeper reasoning but Anthropic can and does ignore it without even telling you.

bisonbear1mo ago

Claude does appear to work for longer, and use more tokens, when at higher reasoning modes. It just doesn't seem like this increased token usage leads to better actual outcomes

gigatexal1mo ago· 1 in thread

I think it is. We have been using it at my day job and we regularly choose sonnet 4.6 for well scoped things. Opus 4.6 was good but the 4.7 opus model burns so many tokens and dollars that it’s just not worth it given the incremental improvement in results.

vincent_sOP1mo ago

They also changed how they count tokens. So you could end up with less reasoning while paying for more tokens. Anthropic’s profit margin is definitely higher on 4.7 then it was an 4.6. I’m pretty sure this was the main driver behind this update.

saltypixel1mo ago

I've noticed in 4.7, I have to regularly say, just say done when done. Otherwise I get 15min loaded recaps in followup responses. 4.7 seems overly chatty vs lower models

j / k navigate · click thread line to collapse

6 comments

6 comments · 3 top-level

bisonbear1mo ago· 2 in thread

Snapshot of the results (sorry for busted format, ask your llm for dataviz. cant seem to format a good table in the comments)

Opus 4.7 on GraphQL-go-tools:

Low: 23/29 pass, 10/29 equivalent, 5/29 review-pass, custom avg 2.598, $2.50/task, 384s/task

Medium: 28/29 pass, 14/29 equivalent, 10/29 review-pass, custom avg 2.759, $3.15/task, 451s/task

High: 26/29 pass, 12/29 equivalent, 7/29 review-pass, custom avg 2.670, $5.01/task, 716s/task

Xhigh: 25/29 pass, 11/29 equivalent, 4/29 review-pass, custom avg 2.669, $6.51/task, 804s/task

Max: 27/29 pass, 13/29 equivalent, 8/29 review-pass, custom avg 2.690, $8.84/task, 997s/task

(custom avg is a set of rubrics used for llm-as-a-judge, graded out of 4)

Practically, the results indicate that medium has better outcomes, or at least the same outcomes, considering variance, as higher reasoning efforts, at a much lower cost/time.

vincent_sOP1mo ago

The effort parameter in Claude Code is essentially useless. It’s just an expression that you wish it to do deeper reasoning but Anthropic can and does ignore it without even telling you.

bisonbear1mo ago

Claude does appear to work for longer, and use more tokens, when at higher reasoning modes. It just doesn't seem like this increased token usage leads to better actual outcomes

gigatexal1mo ago· 1 in thread

vincent_sOP1mo ago

saltypixel1mo ago

I've noticed in 4.7, I have to regularly say, just say done when done. Otherwise I get 15min loaded recaps in followup responses. 4.7 seems overly chatty vs lower models

j / k navigate · click thread line to collapse