> On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5. However, this result is confounded by additional internal and external analysis suggesting that Claude Opus 4.6 is often able to distinguish evaluations from real-world deployment, even when this awareness is not verbalized.
[1] https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea...