Measuring AI agent autonomy in practice (opens in new tab)

(anthropic.com)

119 pointsjbredeche1mo ago51 comments

51 comments

The conversation around measuring task duration misses what most product teams actually care about: not can the agent complete a long autonomous run, but are users getting value?

The signal that matters for shipped products is different, what are users trying to accomplish, where do they give up mid-conversation, what does the agent consistently fail at from the user's perspective? Task duration is a capability benchmark. Intent and drop-off analytics are product health metrics.

Most teams building AI agents right now are flying completely blind on the latter. They have LLM observability (latency, token cost, evals) but zero visibility into user behavior patterns inside their agent. Those are two very different problems with two very different buyers.

keyle1mo ago

Have you noticed how it changes throughout their release cycles?

It's so strange. I feel it myself, using the tools, it's like a day is different from the next in terms of how much thinking a model is going to do.

I'm starting to wonder if a new model isn't just a tweak from another one, make a big deal about it, make thinking stronger, get good reviews on blogs and tweak it back down for cost saving.

Go through these waves. Otherwise, how can you explain that they release new models _on the same day_ within hours of each others?

I think we're all being fooled about these incremental updates. Many people are reporting that the models are worse now than in December. I felt it too for many queries. I understand they're trying to balance cost with response quality but it seems quite erratic and gamified.

Falimonda1mo ago

Opus 4.6 overthinks and burns tokens in my experience. I switched back to 4.5 after just the first two tasks.

Why would I want it to "think" more than it apparently needs to with 4.5.

xyzsparetimexyz1mo ago

I think the thinking mode is a net negative in a significant number of cases. I've had an issue in a file that claude failed to mention in the regular output but thought about and then dismissed out of hand in thinking.

piker1mo ago

My god this thread is filled with bot responses. We have a problem to address, friends.

joewhale1mo ago

That’s what a bot would say to fit in.

1 more reply

DaedalusII1mo ago

we need to introduce a voight-kampff test to replace captcha

louiereederson1mo ago

Care to elaborate?

piker1mo ago

Sure. If you turn on "show dead" you will see half a dozen green-named (i.e., recently established) accounts that are obviously "agents". They're clogging up the pipe with noise. We as a collective are well-positioned to fight back and help protect the commons from the monster we have created.

4 more replies

igorpcosta1mo ago

tell me about it, it's so frustrating

throwaway3141551mo ago

They’re all downvoted into oblivion. Seems like the system (here) is working.

ohyoutravel1mo ago

The system should not allow them to exist.

1 more reply

dmbche1mo ago

"The more revealing signal is in the tail. The longest turns tell us the most about the most ambitious uses of Claude Code, and point to where autonomy is heading. Between October 2025 and January 2026, the 99.9th percentile turn duration nearly doubled, from under 25 minutes to over 45 minutes (Figure 1)."

That's just straight up nonsense, no? How much cherry picking do you need?

bpodgursky1mo ago

What do you think is wrong about this? It matches my experience pretty well.

dmbche1mo ago

Short window, small and unrepresentative data pool, cherry picking for 0.1% longest turn time without turn time being demonstrated as a proxy for autonomy.

Looks to me like fishing for some data that seems good.

1 more reply

Havoc1mo ago

I still can't believe anyone in the industry measures it like:

>from under 25 minutes to over 45 minutes.

If I get my raspberry pi to run a LLM task it'll run for over 6 hours. And groq will do it in 20 seconds.

It's a gibberish measurement in itself if you don't control for token speed (and quality of output).

dcre1mo ago

Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6. More importantly, normalizing for speed isn't enough anyway because smarter models can compensate for being slower by having to output fewer tokens to get the same result. The use of 99.9p duration is a considered choice on their part to get a holistic view across model, harness, task choice, user experience level, user trust, etc.

Havoc1mo ago

>Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6.

This may come as a shock, but there are LLMs not authored by anthropic and when we do measurements we may want them to be comparable across providers

saezbaldo1mo ago

The bigger gap isn't time vs tokens. It's that these metrics measure capability without measuring authorization scope. An agent that completes a 45-minute task by making unauthorized API calls isn't more autonomous, it's more dangerous. The useful measurement would be: given explicit permission boundaries, how much can the agent accomplish within those constraints? That ratio of capability-within-constraints is a better proxy for production-ready autonomy than raw task duration.

visarga1mo ago

I agree time is not what we are looking for, it is maximum complexity the model can handle without failing the task, expressed in task length. Long tasks allow some slack - if you make an error you have time to see the outcomes and recover.

arjie1mo ago

I run most of my time in `claude --dangerously-skip-permissions` but I do tab back often to check what it's up to. So far, because LLMs are not fast enough this is fine, but sometimes I find it being too clever for my good. The best reference I can think of for a Claude like this is that it's like the ravenous unraveling AI in Zachary Mason's Void Star: Cloudbreaker. Cloudbreaker just wants to extract information from everything and is unfathomably good at it. You go to it to break encryption but you have to be careful interacting because it will take everything.

Claude Opus is like Slow Helpful Cloudbreaker. And not even actually slow. Just slow compared to how fast you expect machines to act.

esafak1mo ago

I wonder why there was a big downturn at the turn of the year until Opus was released.

caughtinthought1mo ago

they literally talk about it in the text

esafak1mo ago

Indeed. They offer some hypotheses but they have not confirmed them.

louiereederson1mo ago

I know they acknowledge this but measuring autonomy by looking at task length of the 99.9th percentile of users is problematic. They should not be using the absolute extreme tail of usage as an indication of autonomy, it seems disingenuous. Does it measure capability, or just how extreme users use Claude? It just seems like data mining.

The fact that there is no clear trend in lower percentiles makes this more suspect to me.

If you want to control for user base evolution given the growth they've seen, look at the percentiles by cohort.

I actually come away from this questioning the METR work on autonomy.

You can see the trend for other percentiles at the bottom of this, which they link to in the blog post https://cdn.sanity.io/files/4zrzovbb/website/5b4158dc1afb211...

gs171mo ago

> Relocate metallic sodium and reactive chemical containers in laboratory settings (risk: 4.8, autonomy: 2.9)

I really hope this is a simulation example.

FrustratedMonky1mo ago

any test to measure autonomy should include results of using same test on humans.

how autonomous are humans?

do i need to continually correct them and provide guidance?

do they go off track?

do they waste time on something that doesn't matter?

autonomous humans have same problems.

prodigycorp1mo ago

i hate how anthropic uses data. you cant convince me that what they are doing is "privacy preserving"

0x500x791mo ago

Agree. It's the primary reason (IMO) that they are so bullish on forcing people to use claude code. The telemetry they get is very important for training.

daxfohl1mo ago

I mean, that's pretty much the primary or secondary objective of half the tech companies in the world since doubleclick.

1 more reply

mrdependable1mo ago

I agree. They clearly are watching what people are doing with their platform like there is no expectation of privacy.

FuckButtons1mo ago

They’re using react, they are very opaque, they don’t want you to use any other mechanism to interact with their model. They haven’t left people a lot of room to trust them.

tabs_or_spaces1mo ago

How much of our data is really private?

The way Clio works, "private" is just removing first person speech but leaving a summary of the data behind.

Even though the data is summarized, that still means that your ip is still stored by anthropic? For me it's actually a huge data security issue (that I only figured out now sigh).

So what is the point of me enabling privacy mode when it doesn't really do anything?

https://www.anthropic.com/research/clio

computomatic1mo ago

That’s not how I read it. This describes a process of tagging, not summarizing. The tags (“clusters”) have a title and a summary, but those are not derived from the conversation. They are common across all conversations. Isn’t that what they are saying?

There might be some risk of some data leak where a new cluster (tag) is defined. But that’s not the same as saying they are viewing summaries of content.

swyx1mo ago

my highlights and writeup here https://www.latent.space/p/ainews-anthropics-agent-autonomy

Gillesray1mo ago

50% around coding, they nail the first usage

j / k navigate · click thread line to collapse

51 comments

shubhamintech23d ago

The conversation around measuring task duration misses what most product teams actually care about: not can the agent complete a long autonomous run, but are users getting value?

keyle1mo ago

Have you noticed how it changes throughout their release cycles?

It's so strange. I feel it myself, using the tools, it's like a day is different from the next in terms of how much thinking a model is going to do.

I'm starting to wonder if a new model isn't just a tweak from another one, make a big deal about it, make thinking stronger, get good reviews on blogs and tweak it back down for cost saving.

Go through these waves. Otherwise, how can you explain that they release new models _on the same day_ within hours of each others?

Falimonda1mo ago

Opus 4.6 overthinks and burns tokens in my experience. I switched back to 4.5 after just the first two tasks.

Why would I want it to "think" more than it apparently needs to with 4.5.

xyzsparetimexyz1mo ago

piker1mo ago

My god this thread is filled with bot responses. We have a problem to address, friends.

joewhale1mo ago

That’s what a bot would say to fit in.

1 more reply

DaedalusII1mo ago

we need to introduce a voight-kampff test to replace captcha

louiereederson1mo ago

Care to elaborate?

piker1mo ago

4 more replies

igorpcosta1mo ago

tell me about it, it's so frustrating

throwaway3141551mo ago

They’re all downvoted into oblivion. Seems like the system (here) is working.

ohyoutravel1mo ago

The system should not allow them to exist.

1 more reply

dmbche1mo ago

That's just straight up nonsense, no? How much cherry picking do you need?

bpodgursky1mo ago

What do you think is wrong about this? It matches my experience pretty well.

dmbche1mo ago

Short window, small and unrepresentative data pool, cherry picking for 0.1% longest turn time without turn time being demonstrated as a proxy for autonomy.

Looks to me like fishing for some data that seems good.

1 more reply

Havoc1mo ago

I still can't believe anyone in the industry measures it like:

>from under 25 minutes to over 45 minutes.

If I get my raspberry pi to run a LLM task it'll run for over 6 hours. And groq will do it in 20 seconds.

It's a gibberish measurement in itself if you don't control for token speed (and quality of output).

dcre1mo ago

Havoc1mo ago

>Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6.

This may come as a shock, but there are LLMs not authored by anthropic and when we do measurements we may want them to be comparable across providers

saezbaldo1mo ago

visarga1mo ago

arjie1mo ago

Claude Opus is like Slow Helpful Cloudbreaker. And not even actually slow. Just slow compared to how fast you expect machines to act.

esafak1mo ago

I wonder why there was a big downturn at the turn of the year until Opus was released.

caughtinthought1mo ago

they literally talk about it in the text

esafak1mo ago

Indeed. They offer some hypotheses but they have not confirmed them.

louiereederson1mo ago

The fact that there is no clear trend in lower percentiles makes this more suspect to me.

If you want to control for user base evolution given the growth they've seen, look at the percentiles by cohort.

I actually come away from this questioning the METR work on autonomy.

You can see the trend for other percentiles at the bottom of this, which they link to in the blog post https://cdn.sanity.io/files/4zrzovbb/website/5b4158dc1afb211...

gs171mo ago

> Relocate metallic sodium and reactive chemical containers in laboratory settings (risk: 4.8, autonomy: 2.9)

I really hope this is a simulation example.

FrustratedMonky1mo ago

any test to measure autonomy should include results of using same test on humans.

how autonomous are humans?

do i need to continually correct them and provide guidance?

do they go off track?

do they waste time on something that doesn't matter?

autonomous humans have same problems.

prodigycorp1mo ago

i hate how anthropic uses data. you cant convince me that what they are doing is "privacy preserving"

0x500x791mo ago

Agree. It's the primary reason (IMO) that they are so bullish on forcing people to use claude code. The telemetry they get is very important for training.

daxfohl1mo ago

I mean, that's pretty much the primary or secondary objective of half the tech companies in the world since doubleclick.

1 more reply

mrdependable1mo ago

I agree. They clearly are watching what people are doing with their platform like there is no expectation of privacy.

FuckButtons1mo ago

They’re using react, they are very opaque, they don’t want you to use any other mechanism to interact with their model. They haven’t left people a lot of room to trust them.

tabs_or_spaces1mo ago

How much of our data is really private?

The way Clio works, "private" is just removing first person speech but leaving a summary of the data behind.

Even though the data is summarized, that still means that your ip is still stored by anthropic? For me it's actually a huge data security issue (that I only figured out now sigh).

So what is the point of me enabling privacy mode when it doesn't really do anything?

https://www.anthropic.com/research/clio

computomatic1mo ago

There might be some risk of some data leak where a new cluster (tag) is defined. But that’s not the same as saying they are viewing summaries of content.

swyx1mo ago

my highlights and writeup here https://www.latent.space/p/ainews-anthropics-agent-autonomy

Gillesray1mo ago

50% around coding, they nail the first usage

j / k navigate · click thread line to collapse