Claude Opus 4.6 accuracy on BridgeBench hallucination test drops from 83% to 68% (opens in new tab)

(twitter.com)

70 pointsbratao2mo ago20 comments

20 comments

12 comments · 3 top-level

Reubend2mo ago· 9 in thread

Because the website doesn't seem to show any sample size of runs, I assume they ran it once across the suite.

The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

I don't see this as evidence that Opus 4.6 has gotten worse.

bsder2mo ago

> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

And how is that an excuse?

I don't care about how good a model could be. I care about how good a model was on my run.

Consequently, my opinion on a model is going to be based around its worst performance, not its best.

As such, this qualifies as strong evidence that Opus 4.6 has gotten worse.

senko2mo ago

>> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

> And how is that an excuse? […] this qualifies as strong evidence…

This qualifies as nothing due to how random processes work, that’s what the gp is saying. The numbers are not reliable if it’s just one run.

If this is counter-intuitive, a refresher on basic statistics and probability theory may be in order.

1 more reply

jmalicki2mo ago

No, what they're saying is the previous run could have just been lucky and not representative!

slurpyb2mo ago

I would love to know what you’re doing in the harness to not feel the total degradation in experience now in comparison to December & January.

coldtea2mo ago

>I don't see this as evidence that Opus 4.6 has gotten worse.

I see it as corroboration evidence of actual everyday experience.

Also, any reason to imply "BridgeBench", apparently dedicated to AI benchmarking, wouldn't have run it more than once across the suite?

Reubend2mo ago

> Also, any reason to imply "BridgeBench", apparently dedicated to AI benchmarking, wouldn't have run it more than once across the suite?

They didn't list a sample size of runs, didn't show any numbers for variance across runs, etc...

So while they may have done that behind the scenes and just not told us, this doesn't seem like a rigorous analysis to me. It seems to me like people just want to find data that support the conclusion they already decided on (which is that Opus got worse).

dlahoda2mo ago

are models really non deterministic?

Rury2mo ago

People are describing the results when they say models are non-deterministic. Give it the same exact input twice, and you'll get two different outputs. Deterministic would mean the same input always gives the same output.

loneboat2mo ago

Yes. Look up LLM "temperature" - it's an internal parameter that tweaks how deterministic they behave.

1 more reply

spacebacon2mo ago

Computational semiotics has been empirically proven. Model releasing soon. In the mean time, for the love of god someone recognize this and help blow these numbers out of the water.

https://open.substack.com/pub/sublius/p/the-semiotic-reflexi...

ehtbanton2mo ago

Benchmarks like this one are designed to thoroughly test the model across several iterations. 15% is a MASSIVE discrepancy.

Come on Anthropic, admit what you're doing already and let us access your best models unhindered, even if it costs us more. At the moment we just all feel short-changed.

j / k navigate · click thread line to collapse

20 comments

12 comments · 3 top-level

Reubend2mo ago· 9 in thread

Because the website doesn't seem to show any sample size of runs, I assume they ran it once across the suite.

The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

I don't see this as evidence that Opus 4.6 has gotten worse.

bsder2mo ago

> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

And how is that an excuse?

I don't care about how good a model could be. I care about how good a model was on my run.

Consequently, my opinion on a model is going to be based around its worst performance, not its best.

As such, this qualifies as strong evidence that Opus 4.6 has gotten worse.

senko2mo ago

>> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

> And how is that an excuse? […] this qualifies as strong evidence…

This qualifies as nothing due to how random processes work, that’s what the gp is saying. The numbers are not reliable if it’s just one run.

If this is counter-intuitive, a refresher on basic statistics and probability theory may be in order.

1 more reply

jmalicki2mo ago

No, what they're saying is the previous run could have just been lucky and not representative!

slurpyb2mo ago

I would love to know what you’re doing in the harness to not feel the total degradation in experience now in comparison to December & January.

coldtea2mo ago

>I don't see this as evidence that Opus 4.6 has gotten worse.

I see it as corroboration evidence of actual everyday experience.

Also, any reason to imply "BridgeBench", apparently dedicated to AI benchmarking, wouldn't have run it more than once across the suite?

Reubend2mo ago

> Also, any reason to imply "BridgeBench", apparently dedicated to AI benchmarking, wouldn't have run it more than once across the suite?

They didn't list a sample size of runs, didn't show any numbers for variance across runs, etc...

dlahoda2mo ago

are models really non deterministic?

Rury2mo ago

loneboat2mo ago

Yes. Look up LLM "temperature" - it's an internal parameter that tweaks how deterministic they behave.

1 more reply

spacebacon2mo ago

Computational semiotics has been empirically proven. Model releasing soon. In the mean time, for the love of god someone recognize this and help blow these numbers out of the water.

https://open.substack.com/pub/sublius/p/the-semiotic-reflexi...

ehtbanton2mo ago

Benchmarks like this one are designed to thoroughly test the model across several iterations. 15% is a MASSIVE discrepancy.

Come on Anthropic, admit what you're doing already and let us access your best models unhindered, even if it costs us more. At the moment we just all feel short-changed.

j / k navigate · click thread line to collapse