Ask HN: Why would we care about "extended time horizons" and LLMs?

5 pointsozozozd5d ago5 comments

Is it more impressive to take longer to answer 2 + 2? It’s not. The longer one takes, the less intelligent we would rate that person.

Somehow for AI agents taking longer is getting praise with the framing “maintaining attention for long-time horizons?”

Have we collectively gone down to room temperature IQs with COVID?

Why would the time dimension matter for a tool that is limited in context window? Doesn’t matter if you fill up the window in 1 second or 60 minutes. Also, it’s super easy to game. Insert random lags, reduce tokens/sec, there you have a model that maintains attention over “long-time horizons”

Maybe more importantly how do people in this field buy into these easily game-able non-indicators so easily? How did they not develop the instinct to instantly call out metrics like lines of code, number of tokens burned or time taken to process a task as BS the instant they hear it?

How do they benchmark their code? The longer running the better? Number of CPU cycles spent?

5 comments

ben_w5d ago

You have a common misunderstanding of what is meant by "time horizon".

This is not "how long does AI take to do ${thing}", it is "how long does *human* take to do ${thing}, where ${thing} is from the set of things that AI has probability = n of getting right", where n happens to be 50% or 80% in the METR studies.

At least, that's the short answer, here's a video with more depth: https://www.youtube.com/watch?v=evSFeqTZdqs

My experience is the AI actually completes the task in a few minutes, when it was a 2-ish hour task and the AI has a time horizon of 2 hours at P(correct) = 0.8. It is I the human, not the AI used by me, that would have taken 2 hours.

ozozozdOP5d ago

Not misunderstanding. And I had assumed what you described at first as well.

All I see now is celebration of how agents run for hours and handle “long-time horizons.”

Although the original definition is also flawed for coding. How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric.

ben_w5d ago

> Not misunderstanding.

Then why did you write "Also, it’s super easy to game. Insert random lags, reduce tokens/sec, there you have a model that maintains attention over “long-time horizons”"?

The wall-clock time the LLM spends per task isn't the metric. How long you can leave the LLM alone, wall-clock time, without intervention, isn't "long-time horizons", it's more like "I gave it a list of tasks and it worked through them". Which is neat when it works, but different.

> All I see now is celebration of how agents run for hours and handle “long-time horizons.”

Yes? And? The long time horizons is with reference *to how long it would take humans to do*. Of course this is celebrated. When I've experimented with them, quite often after finishing one task from the plan, they'll go right on to the next task. Each task may take minutes, but the plan can have hundreds of items in it, and hundreds of minute-by-the-clock tasks is indeed hours.

You're literally, on your opening sentence, complaining about 2 + 2 taking longer to solve, this isn't even close to the point of the "time horizons" metric.

> How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric.

Mostly it wasn't estimated, but rather *measured*:

  2.2 Baselining

  In order to ground AI agent performance, we also measure the performance of multiple human “baseliners” on most tasks and recorded the duration of their attempts. In total, we use over 800 baselines totaling 2,529 hours, of which 558 baselines (286 successful) come from HCAST and RE-Bench, and 249 (236 successful) from the shorter SWAA tasks. 148 of the 169 tasks have human baselines, but we rely on researcher estimates for 21 tasks in HCAST.

  Our baseliners are skilled professionals in software engineering, machine learning, and cybersecurity, with the majority having attended world top-100 universities. They have an average of about 5 years of relevant experience, with software engineering baseliners having more experience than ML or cybersecurity baseliners. For more details about baselines, see Appendix C.1.

- https://arxiv.org/html/2503.14499v3

As with all the other metrics, this is now basically saturated, as nobody seems to want to pay METR $4M to hire a statistically significant number of engineers to spend 4h-1w on each of another 800 baselines for longer tasks. Or if they are, it's being kept very quiet.

ozozozdOP4d ago

Ok - someone should tell these people that agents running for hours isn’t a measure of success then.

Not sure how you’d measure software engineering tasks in an isolated manner like that. There are things I need to look up docs for, and others I don’t need to. And that depends on the person. There are tedious tasks that I sometimes get right with my first try, other times I have to look away for a minute and look back at it to get right. There is internet speed. Task evolves or architecture changes mid-task.

I wouldn’t consider anything well-defined and repetitively measurable a “long-time horizon task” - adding a new HTTP handler isn’t one, adding a new React route isn’t one.

Edit: Apparently there are people who care to be precise about this. See: https://subq.ai and how they describe it as "long‑context tasks."

1 more reply

j / k navigate · click thread line to collapse

5 comments

ben_w5d ago

You have a common misunderstanding of what is meant by "time horizon".

At least, that's the short answer, here's a video with more depth: https://www.youtube.com/watch?v=evSFeqTZdqs

ozozozdOP5d ago

Not misunderstanding. And I had assumed what you described at first as well.

All I see now is celebration of how agents run for hours and handle “long-time horizons.”

ben_w5d ago

> Not misunderstanding.

Then why did you write "Also, it’s super easy to game. Insert random lags, reduce tokens/sec, there you have a model that maintains attention over “long-time horizons”"?

> All I see now is celebration of how agents run for hours and handle “long-time horizons.”

You're literally, on your opening sentence, complaining about 2 + 2 taking longer to solve, this isn't even close to the point of the "time horizons" metric.

Mostly it wasn't estimated, but rather *measured*:

  2.2 Baselining

  In order to ground AI agent performance, we also measure the performance of multiple human “baseliners” on most tasks and recorded the duration of their attempts. In total, we use over 800 baselines totaling 2,529 hours, of which 558 baselines (286 successful) come from HCAST and RE-Bench, and 249 (236 successful) from the shorter SWAA tasks. 148 of the 169 tasks have human baselines, but we rely on researcher estimates for 21 tasks in HCAST.

  Our baseliners are skilled professionals in software engineering, machine learning, and cybersecurity, with the majority having attended world top-100 universities. They have an average of about 5 years of relevant experience, with software engineering baseliners having more experience than ML or cybersecurity baseliners. For more details about baselines, see Appendix C.1.

- https://arxiv.org/html/2503.14499v3

ozozozdOP4d ago

Ok - someone should tell these people that agents running for hours isn’t a measure of success then.

I wouldn’t consider anything well-defined and repetitively measurable a “long-time horizon task” - adding a new HTTP handler isn’t one, adding a new React route isn’t one.

Edit: Apparently there are people who care to be precise about this. See: https://subq.ai and how they describe it as "long‑context tasks."

1 more reply

j / k navigate · click thread line to collapse