undefined | Better HN

0 pointswithinboredom1mo ago0 comments

Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.

0 comments

hhh1mo ago

The models don’t change.

On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.

armchairhacker1mo ago

And there’s an incentive to publish evidence of this to discourage it, do you have any?

TeMPOraL1mo ago

Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.

1 more reply

woadwarrior011mo ago

There's this[1]. Model providers have a strong incentive to switch (a part of) their inference fleet to quantized models during peak loads. From a systems perspective, it's just another lever. Better to have slightly nerfed models than complete downtime.

[1]: https://marginlab.ai/trackers/claude-code/

1 more reply

coldtea1mo ago

Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...

seunosewa1mo ago

Or just change the reasoning levels.

fer1mo ago

They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".

stavros1mo ago

Make that 2, I told my friends yesterday "Opus got dumb, new model must be coming".

arcanemachiner1mo ago

I swear that difference sessions will route to different quants. Sometimes it's good, sometimes not.

esskay1mo ago

Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.

yorwba1mo ago

Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.

https://marginlab.ai/trackers/claude-code/ tries to track Claude Opus performance on SWE-Bench-Pro, but since they only sample 50 tasks per day, the confidence intervals are very wide. (This was submitted 2 months ago https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)

nextaccountic1mo ago

It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark

1 more reply

scrollop1mo ago

You sure about that?

https://marginlab.ai/trackers/claude-code/

withinboredomOP1mo ago

Well, I don't see 4.5 on there ... so I'm not sure what you're trying to say.

And today is a 53% pass rate vs. a baseline 56% pass rate. That's a huge difference. If we recall what Anthropic originally promised a "max 5" user https://github.com/anthropics/claude-code/issues/16157#issue... -- which they've since removed from their site...

50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.

coldtea1mo ago

Only nominally...

pixel_popping1mo ago

Oh yes, they do.

girvo1mo ago

I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.

coldtea1mo ago

No conspiracy theories. Companies being scumbags, cutting corners, and doctoring benchmarks while denying it. Happens since forever.

j / k navigate · click thread line to collapse

0 comments

hhh1mo ago

The models don’t change.

tornikeo1mo ago

On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.

armchairhacker1mo ago

And there’s an incentive to publish evidence of this to discourage it, do you have any?

TeMPOraL1mo ago

1 more reply

woadwarrior011mo ago

[1]: https://marginlab.ai/trackers/claude-code/

1 more reply

coldtea1mo ago

Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...

seunosewa1mo ago

Or just change the reasoning levels.

fer1mo ago

stavros1mo ago

Make that 2, I told my friends yesterday "Opus got dumb, new model must be coming".

arcanemachiner1mo ago

I swear that difference sessions will route to different quants. Sometimes it's good, sometimes not.

esskay1mo ago

yorwba1mo ago

nextaccountic1mo ago

1 more reply

scrollop1mo ago

You sure about that?

https://marginlab.ai/trackers/claude-code/

withinboredomOP1mo ago

Well, I don't see 4.5 on there ... so I'm not sure what you're trying to say.

50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.

coldtea1mo ago

Only nominally...

pixel_popping1mo ago

Oh yes, they do.

girvo1mo ago

I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.

coldtea1mo ago

No conspiracy theories. Companies being scumbags, cutting corners, and doctoring benchmarks while denying it. Happens since forever.

j / k navigate · click thread line to collapse