The signal that matters for shipped products is different, what are users trying to accomplish, where do they give up mid-conversation, what does the agent consistently fail at from the user's perspective? Task duration is a capability benchmark. Intent and drop-off analytics are product health metrics.
Most teams building AI agents right now are flying completely blind on the latter. They have LLM observability (latency, token cost, evals) but zero visibility into user behavior patterns inside their agent. Those are two very different problems with two very different buyers.
It's so strange. I feel it myself, using the tools, it's like a day is different from the next in terms of how much thinking a model is going to do.
I'm starting to wonder if a new model isn't just a tweak from another one, make a big deal about it, make thinking stronger, get good reviews on blogs and tweak it back down for cost saving.
Go through these waves. Otherwise, how can you explain that they release new models _on the same day_ within hours of each others?
I think we're all being fooled about these incremental updates. Many people are reporting that the models are worse now than in December. I felt it too for many queries. I understand they're trying to balance cost with response quality but it seems quite erratic and gamified.
Why would I want it to "think" more than it apparently needs to with 4.5.
That's just straight up nonsense, no? How much cherry picking do you need?
Looks to me like fishing for some data that seems good.
>from under 25 minutes to over 45 minutes.
If I get my raspberry pi to run a LLM task it'll run for over 6 hours. And groq will do it in 20 seconds.
It's a gibberish measurement in itself if you don't control for token speed (and quality of output).
This may come as a shock, but there are LLMs not authored by anthropic and when we do measurements we may want them to be comparable across providers
Claude Opus is like Slow Helpful Cloudbreaker. And not even actually slow. Just slow compared to how fast you expect machines to act.
The fact that there is no clear trend in lower percentiles makes this more suspect to me.
If you want to control for user base evolution given the growth they've seen, look at the percentiles by cohort.
I actually come away from this questioning the METR work on autonomy.
You can see the trend for other percentiles at the bottom of this, which they link to in the blog post https://cdn.sanity.io/files/4zrzovbb/website/5b4158dc1afb211...
I really hope this is a simulation example.
how autonomous are humans?
do i need to continually correct them and provide guidance?
do they go off track?
do they waste time on something that doesn't matter?
autonomous humans have same problems.
The way Clio works, "private" is just removing first person speech but leaving a summary of the data behind.
Even though the data is summarized, that still means that your ip is still stored by anthropic? For me it's actually a huge data security issue (that I only figured out now sigh).
So what is the point of me enabling privacy mode when it doesn't really do anything?
There might be some risk of some data leak where a new cluster (tag) is defined. But that’s not the same as saying they are viewing summaries of content.