> Ok - someone should tell these people that agents running for hours isn’t a measure of success then.
To quote the researchers who coined the term:
Does “time horizon” mean the length of time that current AI agents can act autonomously?
No. The 50%-time horizon is the length of task (measured by how long it takes a human expert) that an AI agent can complete with 50% reliability. It’s a measure of the difficulty of a task, rather than the time an AI spends to complete the task.
-
https://metr.org/time-horizons/If by "these people" you man people like you who conflate "long time horizon" with "long wall-clock time" like you did, then yes, that's why I replied to you.
Conversely, when a researcher says "I can leave my LLM running for hours, because it has a long time horizon", this is *causality*. Car analogy: if time horizon is fuel efficiency, the LLM working by itself for hours at a time is like driving your car for thousands of miles. The latter can obviously be gamed by having a bigger fuel tank, but also comes automatically from having a more efficient engine. Max range != Engine efficiency, but more efficient engines increase range. "Long wall clock time without intervention" != "Long time horizon", but longer time horizons increase wall clock time without intervention.
In fact, another relevant quote from the researchers who coined the term:
What does METR mean by a task? Would solving 1000 1-hour math problems in a row be a 1000-hour task?
Our tasks are meant to be coherent, self-contained units of work that can’t be trivially split into independent pieces. Therefore, solving 1000 separate 1-hour math problems isn’t a 1000-hour task; we’d consider it a 1-hour task done 1000 times. The same idea applies for searching for needles in a 10-million-word haystack. In either case, you could easily split the work across many people working in parallel (or by making many parallel AI calls), so it’s not really a “long” task in the sense we care about.
In contrast, the prototypical multi-hour task might look like iteratively debugging a complex system, where each fix reveals new problems that only make sense if you know what you already tried.
-
https://metr.org/time-horizons/> Not sure how you’d measure software engineering tasks in an isolated manner like that. There are things I need to look up docs for, and others I don’t need to. And that depends on the person. There are tedious tasks that I sometimes get right with my first try, other times I have to look away for a minute and look back at it to get right. There is internet speed. Task evolves or architecture changes mid-task.
Are you unfamiliar with how statistics deal with such things? Even the quote I gave you in the previous comment had some of the humans failing to complete some of the tasks.
Also, to quote the researchers who coined the term:
Our tasks are designed to be self-contained and well-specified, so that they’re fair to both the AI agents and the humans. In contrast, most real-world work draws on prior context, such as previous conversations, tacit knowledge, or familiarity with an existing code base. We think it’s better to think of our 2-hour tasks as what someone with low or no prior context (like a new hire or freelance contractor) could complete in 2 hours, rather than someone experienced who is already familiar with the project.
-
https://metr.org/time-horizons/> I wouldn’t consider anything well-defined and repetitively measurable a “long-time horizon task” - adding a new HTTP handler isn’t one, adding a new React route isn’t one.
First, "long" is a relative statement, not absolute. The early models could *only* reliably help with things that take a human a few seconds, e.g. stubbing out a function. Now they're up to 1.5 hours at P(success)=80%, or 11h59m at P(success)=50%. These are what "long time horizons" means in these cases: https://metr.org/time-horizons/
Second, the entire point of the METR study I linked you to, is to put those tasks you are dismissive of on the same chart as frontier models and early models, in order to find out what kind of things each model can do. I suggest reading it or watching the video, both explain this point.
> Edit: Apparently there are people who care to be precise about this. See: https://subq.ai and how they describe it as "long‑context tasks."
Incorrect. "Long context" is a third thing, "long context" != "Long time horizon" != "Long wall clock time without intervention".
In the car analogy, where "time horizon" maps to fuel efficiency, wall clock time maps to range, context maps to how good your field of view is from the driving seat.