Claude Opus 4.7 Model Card (opens in new tab)

(anthropic.com)

177 pointsadocomplete2mo ago84 comments

84 comments

60 comments · 18 top-level

bachittle2mo ago· 10 in thread

So Opus 4.7 is measurably worse at long-context retrieval compared to Opus 4.6. Opus 4.6 scores 91.9% and Opus 4.7 scores 59.2%. At least they're transparent about the model degradation. They traded long-context retrieval for better software engineering and math scores.

film422mo ago

To be honest, I think it's just a more honest score of what Opus 4.6 actually was. Once contexts get sufficiently large, Opus develops pretty bad short term memory loss.

tomaskafka2mo ago

You can support very long context windows if you don’t mind abysmal recall rate.

enraged_camel2mo ago

No: https://x.com/bcherny/status/2044821690920980626

freedomben2mo ago

Agreed, I appreciate the transparency (and Anthropic isn't normally very transparent). It's also great to know because I will change how I approach long contexts knowing it struggles more with them.

RobinL2mo ago

Could this be because they've found the 1m context uneconomical (ie costs too much to serve, or burns through users quota too quickly causing complaints), and so they're no longer targeting it as a goal

1 more reply

teaearlgraycold2mo ago

A year ago it felt like SoTA model developers were not improving so much as moving the dirt around. Maybe we’re in another such rut.

msla2mo ago

Also, just to be clear: This links to a PDF, for some reason.

jzig2mo ago

At what point along the 1M window does context become "long" enough that this degradation occurs?

daemonologist2mo ago

The benchmark GP mentioned is measuring at 128k-256k context (there's another at 524k-1024k, where 4.6 scored 78.3% and 4.7 scored 32.2%).

The longer the context the worse the performance; there isn't really a qualitative step change in capability (if there is imo it happens at like 8k-16k tokens, much sooner than is relevant for multi-turn coding tasks - see e.g. this old benchmark https://github.com/adobe-research/NoLiMa ).

the132mo ago

Be brief. No one wants AI boyfriend users who drone on & on about their day.

STRiDEX2mo ago· 8 in thread

Dumb question but why are chemical weapons always addressed as a risk with llms? Is the idea that they contain how to make chemical weapons or that they would guide someone on how?

Would there not already be websites that contain that information? How is an llm different, i guess, from some sort of anarchist cookbook thing.

Philpax2mo ago

Both. There's the risk of them instructing a user on how to produce a known formulation (the Anarchist Cookbook solution, as you say), which is irritating but not that problematic.

The bigger issue is that they are potentially capable of producing novel formulations capable of producing harm, and guiding someone through this process. That is, consider a world in which someone with malicious desires has access to a model as capable at chemistry / biology as Mythos is at offensive cybersecurity abilities.

This is obviously limited by the fact that the models don't operate in the physical world, but there's plenty of written material out there.

rogerrogerr2mo ago

The world has been blessed by two connected things:

1. Smart people have economic opportunities that align them away from being evil

2. People who are evil tend not to be smart.

We're breaking both of these assumptions.

4 more replies

dcre2mo ago

LLMs can tell you exactly how to acquire the materials and manufacture the materials. They might even come up with novel formulations that rely on substances that are easier to get. There might be information about this stuff online but LLMs are much better than random idiots at adapting that information to their actual situation.

On top of LLMs reducing the cost/difficulty, the other reason biological and chemical weapons are such a worry is their asymmetric character — they are much much easier and cheaper to produce and deploy than they are to defend against.

Aboutplants2mo ago

It’s marketing, Fear is one of the most effective marketing tools. That and purpose of government attention

somesortofthing2mo ago

They contain broad overviews(throw some disease-causing bacteria in a sort of rainbow arrangement of increasingly more effective antibiotics, you'll usually get something that's at least very deadly even if it doesn't have pandemic potential) but executing in a real lab takes a ton of trial and error to figure out the details. The issue is that the details ~all exist somewhere in the training dataset already, discovered and documented over the course of unrelated, benign biology research. Ability to quickly and accurately search over that corpus translates to large speedups in the physical development process.

Nicook2mo ago

Probably also a bit of liability. After all its been trained on a dataset that includes a long running joke of trying to trick people on the internet to unknowingly create chlorine gas.

CodingJeebus2mo ago

WAG but I wonder if a hijacked LLM could also assist with figuring out how to obtain required materials, not just provide the recipe.

rgbrenner2mo ago

In the same way that all coding docs are available publicly

jmward012mo ago· 7 in thread

Haiku not getting an update is becoming telling. I suspect we are reaching a point where the low end models are cannibalizing high end and that isn't going to stop. How will these companies make money in a few years when even the smallest models are amazing?

blixt2mo ago

Isn't it pretty common for the smaller models to release a little while after the bigger ones, for all the big model providers?

jmward012mo ago

The last update for Haiku was in October, or in startup land, 10 years ago.

mvkel2mo ago

It seems to be a rule that older models are more expensive than newer ones. The low end models have higher $CPT and worse output. I wonder if the move is to just have one model and quantize if you hit compute constraints

deaux2mo ago

> It seems to be a rule that older models are more expensive than newer ones.

It isn't. Gemini has gotten more expensive with each release. Anthropic has stayed pretty similar over time, no? When is the last time OpenAI dropped API prices? OpenAI started very high because they were the first, so there was a ton of low hanging fruit and there was much room to drop.

1 more reply

qingcharles2mo ago

Google is putting a lot of research into small models. Most of my AI budget is now going to small models because I am doing lots of tiny tasks that the small models do great with. I would think a decent chunk of Goog's API revenue probably comes from their small models.

dkhenry2mo ago

The Gemma models are at this point. A 31B model that can fit on a consumer card is as good as Sonnet 4.5. I haven't put it through as much on the coding front or tool calling as I have the Claude or GPT models, but for text processing it is on par with the frontier models.

make32mo ago

absolutely not on par you're smoking

2 more replies

aliljet2mo ago· 4 in thread

Have they effectively communicated what a 20x or 10x Claude subscription actually means? And with Claude 4.7 increasing usage by 1.35x does that mean a 20x plan is now really a 13x plan (no token increase on the subscription) or a 27x plan (more tokens given to compensate for more computer cost) relative to Claude Opus 4.6?

computomatic2mo ago

They have communicated it as 5x is 5 x Pro, and 20x is 20 x Pro (I haven’t looked lately so not sure if that’s changed).

They have also repeatedly communicated that the base unit (Pro allotment) is subject to change and does change often.

As far as I can tell, that implies there is no guarantee that those subscriptions get some specific number of tokens per unit of time. It’s not a claim they make.

msikora2mo ago

I think as far as the maybe more important weekly allotment Max 5 is 10x Pro and Max 20 is 20x Pro. For the 5 hour window it is as the names would suggest though.

DonsDiscountGas2mo ago

Definitely 13x, at least for now

ModernMech2mo ago

Feels like buying toilet paper.

vessenes2mo ago· 3 in thread

This is an interesting document, in that it reads like a Claude Mythos model card that was hastily edited to be an Opus 4.7 model card.

I surmise that someone at the top put the Mythos release on hold, and the product team was told "ship this other interim step model instead. quickly."

I wonder if 4.7 will be seen as a net step-up in quality; there are some regressions noted in the document, and it's clearly substantially worse than Mythos, at least according to its own model card. Should be an interesting few months -- if I were at oAI I'd be rushing to get something out that's clearly better, and pressing for weakness here.

the132mo ago

What makes you think that? "it reads like a Claude Mythos model card that was hastily edited to be an Opus 4.7 model card"

vessenes2mo ago

There are more mentions of Mythos than 4.6. Mythos results are nearly everywhere, and vastly exceed 4.7's capacity in almost every case. There are sections that report only research on Mythos, none on 4.7. E.g. user surveys about how beneficial Mythos is internally at Anthropic.

barneybooroo2mo ago

Yeah, the section expanding on how they evaluated Mythos internally is a bit baffling considering how irrelevant it is.

koehr2mo ago· 3 in thread

This reads more like an advertisement for Mythos, on the first glance

Uehreka2mo ago

I never understand these critiques. If something is useful and you’re selling it, does that mean any technical document describing its usefulness becomes marketing?

I guess maybe, but then do those documents lose value as technical documents? Not necessarily at all, so I don’t see the point. How are you supposed to describe a useful technical thing to users?

parsimo20102mo ago

This is supposedly the Opus 4.7 model card. It's okay for it to be marketing for Opus 4.7 and describe what it can do, and even okay for it to talk about what it does better than the last generation. GP was saying it sounds like marketing for Mythos (a different and unreleased model). I don't want the Opus 4.7 model card to be advertising for something else.

For context, the word "Mythos" appears 331 times in a 221 page document. "Opus 4.6" appears 240 times, so a reference to a model that nobody has really used happens more often than the reference to the last generation model.

ModernMech2mo ago

That's why I don't like these "model cards" being presented as if they are some sort of technical document -- they're marketing materials.

msla2mo ago· 2 in thread

PDF, because it isn't marked.

marginalia_nu2mo ago

It's not 1998 any more. All browsers read PDFs now.

msla2mo ago

Do you think your comment adds anything?

1 more reply

bicepjai2mo ago· 2 in thread

This card is a 272 page report. So now we are redefining names :)

albert_e2mo ago

Does the model card fit in the model's context :)

anonyfox2mo ago

well it will saturate your 5h limit window at least

deflator2mo ago· 2 in thread

Model Welfare? Are they serious about this? Or is it just more hype? I really don't trust anything this company says anymore. "We have a model that is too dangerous to release" is like me saying that I have a billion dollars in gold that nobody is allowed to see but I expect to be able to borrow against it.

hgoel2mo ago

Maybe referring to it as welfare is odd, but these points are important. It isn't a good look to have a model that tends to get into self-deprecating loops like one of Google's older models, it's an even worse look and potential legal liability if your model becomes associated with a suicide. An overly negative chat model would also just be unpleasant to use.

With the weights being mostly opaque, these kinds of evaluations are an important piece of reducing the harm an AI model can cause.

deflator2mo ago

I feel that anthropomorphizing the model is also potentially very harmful. We've seen that in the LLM interactions that end in tragedy. It's the wording that bothers me.

kube-system2mo ago· 1 in thread

> Chemical and biological weapons threat model 2 (CB-2): Novel chemical/biological weapons production capabilities. A model has CB-2 capabilities if it has the ability to significantly help threat actors (for example, moderately resourced expert-backed teams) create/obtain and deploy chemical and/or biological weapons with potential for catastrophic damages far beyond those of past catastrophes such as COVID-19.

That's an interesting choice of benchmark for measuring the risk of "Chemical and biological weapons"

Aboutplants2mo ago

Gotta prime those Government fears!

Symmetry2mo ago

> The technical error that caused accidental chain-of-thought supervision in some prior models (including Mythos Preview) was also present during the training of Claude Opus 4.7, affecting 7.8% of episodes.

>_>

100ms2mo ago

    $ pbpaste | wc -w 
    62508
    $ pbpaste | grep -oi mythos|wc -w
    331
    $ pbpaste | grep -oi opus|wc -w
    809

joeumn2mo ago

I'm actually surprised at how it performed compared to 4.6 and also compared to mythos. Will be fun to use.

nullc2mo ago

The model card doesn't mention if this revision will continue to make up and fan vicious conspiracy theories like the prior one does.

I've getting a small but steady stream of harassment from mentally ill people who get spun up on crazy conspiracy theories and claude is all too willing to tell them they are ABSOLUTELY RIGHT, encourage them to TAKE ACTION, and telling them that people who disagree are IN ON IT.

The other major AI LLM services will shut down the deflect to be less crazy or shut down conversation entirely, -- but it seems claude doesn't. Anthropic is probably the worst about prattling on about safety but it seems like their concern is mostly centered on insane movie plot threats and less concerned about things with more potential for real harm.

I've complained to anthropic with no response.

il-b2mo ago

Ironically, the website is down

NickNaraghi2mo ago

232 pages is bullshit. Longer than the Mythos system card? What are you hiding.

Rekindle80902mo ago

Can someone please explain the point of these incremental upgrades? Just release one model. Then maybe do a .5. Then do the next version.

What is the justification for .4.5.6.7.8.9 when the difference isn't measurable and it destroys productivity because they test the next increment on the previous one without customer consent?

gignico2mo ago

So LLMs are destroying the economy and the environment but at least “catastrophic risk” is still low. Ok then…

j / k navigate · click thread line to collapse

84 comments

60 comments · 18 top-level

bachittle2mo ago· 10 in thread

film422mo ago

To be honest, I think it's just a more honest score of what Opus 4.6 actually was. Once contexts get sufficiently large, Opus develops pretty bad short term memory loss.

tomaskafka2mo ago

You can support very long context windows if you don’t mind abysmal recall rate.

enraged_camel2mo ago

No: https://x.com/bcherny/status/2044821690920980626

freedomben2mo ago

Agreed, I appreciate the transparency (and Anthropic isn't normally very transparent). It's also great to know because I will change how I approach long contexts knowing it struggles more with them.

RobinL2mo ago

1 more reply

teaearlgraycold2mo ago

A year ago it felt like SoTA model developers were not improving so much as moving the dirt around. Maybe we’re in another such rut.

msla2mo ago

Also, just to be clear: This links to a PDF, for some reason.

jzig2mo ago

At what point along the 1M window does context become "long" enough that this degradation occurs?

daemonologist2mo ago

The benchmark GP mentioned is measuring at 128k-256k context (there's another at 524k-1024k, where 4.6 scored 78.3% and 4.7 scored 32.2%).

the132mo ago

Be brief. No one wants AI boyfriend users who drone on & on about their day.

STRiDEX2mo ago· 8 in thread

Dumb question but why are chemical weapons always addressed as a risk with llms? Is the idea that they contain how to make chemical weapons or that they would guide someone on how?

Would there not already be websites that contain that information? How is an llm different, i guess, from some sort of anarchist cookbook thing.

Philpax2mo ago

Both. There's the risk of them instructing a user on how to produce a known formulation (the Anarchist Cookbook solution, as you say), which is irritating but not that problematic.

This is obviously limited by the fact that the models don't operate in the physical world, but there's plenty of written material out there.

rogerrogerr2mo ago

The world has been blessed by two connected things:

1. Smart people have economic opportunities that align them away from being evil

2. People who are evil tend not to be smart.

We're breaking both of these assumptions.

4 more replies

dcre2mo ago

Aboutplants2mo ago

It’s marketing, Fear is one of the most effective marketing tools. That and purpose of government attention

somesortofthing2mo ago

Nicook2mo ago

Probably also a bit of liability. After all its been trained on a dataset that includes a long running joke of trying to trick people on the internet to unknowingly create chlorine gas.

CodingJeebus2mo ago

WAG but I wonder if a hijacked LLM could also assist with figuring out how to obtain required materials, not just provide the recipe.

rgbrenner2mo ago

In the same way that all coding docs are available publicly

jmward012mo ago· 7 in thread

blixt2mo ago

Isn't it pretty common for the smaller models to release a little while after the bigger ones, for all the big model providers?

jmward012mo ago

The last update for Haiku was in October, or in startup land, 10 years ago.

mvkel2mo ago

deaux2mo ago

> It seems to be a rule that older models are more expensive than newer ones.

1 more reply

qingcharles2mo ago

dkhenry2mo ago

make32mo ago

absolutely not on par you're smoking

2 more replies

aliljet2mo ago· 4 in thread

computomatic2mo ago

They have communicated it as 5x is 5 x Pro, and 20x is 20 x Pro (I haven’t looked lately so not sure if that’s changed).

They have also repeatedly communicated that the base unit (Pro allotment) is subject to change and does change often.

As far as I can tell, that implies there is no guarantee that those subscriptions get some specific number of tokens per unit of time. It’s not a claim they make.

msikora2mo ago

I think as far as the maybe more important weekly allotment Max 5 is 10x Pro and Max 20 is 20x Pro. For the 5 hour window it is as the names would suggest though.

DonsDiscountGas2mo ago

Definitely 13x, at least for now

ModernMech2mo ago

Feels like buying toilet paper.

vessenes2mo ago· 3 in thread

This is an interesting document, in that it reads like a Claude Mythos model card that was hastily edited to be an Opus 4.7 model card.

I surmise that someone at the top put the Mythos release on hold, and the product team was told "ship this other interim step model instead. quickly."

the132mo ago

What makes you think that? "it reads like a Claude Mythos model card that was hastily edited to be an Opus 4.7 model card"

vessenes2mo ago

barneybooroo2mo ago

Yeah, the section expanding on how they evaluated Mythos internally is a bit baffling considering how irrelevant it is.

koehr2mo ago· 3 in thread

This reads more like an advertisement for Mythos, on the first glance

Uehreka2mo ago

I never understand these critiques. If something is useful and you’re selling it, does that mean any technical document describing its usefulness becomes marketing?

I guess maybe, but then do those documents lose value as technical documents? Not necessarily at all, so I don’t see the point. How are you supposed to describe a useful technical thing to users?

parsimo20102mo ago

ModernMech2mo ago

That's why I don't like these "model cards" being presented as if they are some sort of technical document -- they're marketing materials.

msla2mo ago· 2 in thread

PDF, because it isn't marked.

marginalia_nu2mo ago

It's not 1998 any more. All browsers read PDFs now.

msla2mo ago

Do you think your comment adds anything?

1 more reply

bicepjai2mo ago· 2 in thread

This card is a 272 page report. So now we are redefining names :)

albert_e2mo ago

Does the model card fit in the model's context :)

anonyfox2mo ago

well it will saturate your 5h limit window at least

deflator2mo ago· 2 in thread

hgoel2mo ago

With the weights being mostly opaque, these kinds of evaluations are an important piece of reducing the harm an AI model can cause.

deflator2mo ago

I feel that anthropomorphizing the model is also potentially very harmful. We've seen that in the LLM interactions that end in tragedy. It's the wording that bothers me.

kube-system2mo ago· 1 in thread

That's an interesting choice of benchmark for measuring the risk of "Chemical and biological weapons"

Aboutplants2mo ago

Gotta prime those Government fears!

Symmetry2mo ago

>_>

100ms2mo ago

    $ pbpaste | wc -w 
    62508
    $ pbpaste | grep -oi mythos|wc -w
    331
    $ pbpaste | grep -oi opus|wc -w
    809

joeumn2mo ago

I'm actually surprised at how it performed compared to 4.6 and also compared to mythos. Will be fun to use.

nullc2mo ago

The model card doesn't mention if this revision will continue to make up and fan vicious conspiracy theories like the prior one does.

I've complained to anthropic with no response.

il-b2mo ago

Ironically, the website is down

NickNaraghi2mo ago

232 pages is bullshit. Longer than the Mythos system card? What are you hiding.

Rekindle80902mo ago

Can someone please explain the point of these incremental upgrades? Just release one model. Then maybe do a .5. Then do the next version.

What is the justification for .4.5.6.7.8.9 when the difference isn't measurable and it destroys productivity because they test the next increment on the previous one without customer consent?

gignico2mo ago

So LLMs are destroying the economy and the environment but at least “catastrophic risk” is still low. Ok then…

j / k navigate · click thread line to collapse