I went back to the BixBench benchmark which they mentioned. I couldn't find official results for Anthropic models, but I found a project taking Opus 4.6 from 65.3% to 92.0% (which would be above GPT-Rosalind) with nearly 200 carefully crafted skills [1]. There also appears to be competitive competitor models with scores on par with this tuned GPT.
It’s kind of gross to make money off her name (if that’s what’s happening) posthumously. It’s a complicated story anyway. IIRC her sister referred to it as “the Cult of Rosalind” when people were cashing in on books about her.
Sam Altman, August 2025
For me too, it was around that time last year, with GPT-5, Claude Sonnet 4.5 and then Gemini 3 that I started feeling that these models are clearly becoming great at reasoning. I'm not at all opposed to saying that they are around PhD-level on at least some domains.
Earlier this year I tried to do this for a much simpler target than bioscience, a Farnsworth fusor, and even though I started off with ~"which open source physics libraries do you recommend we use for this?" and it giving me a list, instead of actually bothering to use any of those libraries that it suggested, it decided to roll its own simulation code, and the code it wrote very obviously didn't work.
It may *assist* with coding, but I don't think it could code for them yet.
[1] https://github.com/openai/plugins/tree/main/plugins/life-sci...
Why? AI's reputation would be greatly improved by saving a few 10s of millions of lives (per year, I might add). And either of those advances would do just that.
Oh, and another reason. Do either of these things and you'll have very rich businesses screaming to become your customer coming out of every hole. Guaranteed.
I'm absolutely ok with a legitimate lab scientist conducting biochemical research getting suggestions about substances that are generally considered dangerous but might be appropriate for their study, and it'll be up to the scientist to discern whether it is indeed appropriate to use.
At the moment, it feels like releases like this overcommit and overpromise on "PhD level reasoning", which I wouldn't say is the absolute bottleneck in clinical research.