According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461
As other comments point out this is about harness development and harness efficiency. Agentica SDK is a sort of meta harness, that makes things easy: plug any "internal API" (as defined natively in your codebase) directly into your agent. Agentica SDK itself is not application specifc; but the APIs of your application are... application specific.
Re: the linked prompt. A harness is a set of tools and descriptions how to best use those tools, and sometimes some external control flow based on the outcome of using those tools. How to "best use the tools" should always be part of the prompt (like in this case).
So this work tries to answer: "short of telling the agent any solutions, make a simple but efficient API to play the games, hand it to the agent, and see how it does". In the world of harness development I think that's an interesting question to answer!
The hard part of these tests isn't purely reasoning ability ffs.
This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.
EDIT from https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf:
> We seek to fight two forms of overfitting that would muddy public sensefinding:
> Task-specific overfitting. This includes any agent that is created with knowledge of public ARC-AGI-3 environments, subsequently being evaluated on the same environments. It could be either directly trained on these environments, or using a harness that is handcrafted or specifically configured by someone with knowledge of the public environments.
What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”
Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).
But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):
> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration
but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.
I wish we'd move past public test sets for LLM benchmarks: publish a plain english explanation of the tasks, allow questions and clarifications, and but never release a single question from the test set verbatim.
It made sense back when models needed to be finetuned on the task to even reliably answer. If we're saying this is the path to AGI we should be able to rely on the generalization of the model to get it right.
This is the state of "AI" these days I guess...
[1] https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.