> Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%
> GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%
> MMMLU: 92.7% / 91.1% / — / 92.6–93.6%
> USAMO: 97.6% / 42.3% / 95.2% / 74.4%
> OSWorld: 79.6% / 72.7% / 75.0% / —
Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen Opus 4.6 or GPT-5.4, I don't know what to make of the significant jumps on other benchmarks within these same categories. Training to the test? Better training?
And the decision to withhold general release (of a 'preview' no less!) seems to be well, odd. And the decision to release a 'preview' version to specific companies? You know any production teams at these massive companies that would work with a 'preview' anything? R&D teams, sure, but production? Part of me wants to LoL.
What are they trying to do? Induce FOMO and stop subscriber bleed-out stemming from the recent negative headlines around problems with using Claude?