I used autoresearch to improve my AGENTS.md, measured against real tasks (opens in new tab)

(stet.sh)

8 pointsbisonbear25d ago7 comments

7 comments

7 comments · 3 top-level

fuckinpuppers25d ago· 2 in thread

I had a blast having all the major models figure out the most optimal strategy for itself inside of Cursor, with cursorrules, AGENTS.md, .cursor/rules/ mrd files or whatever and learned some interesting things, how it won’t guarantee every instruction even when it’s told to, for example

Seems like the progressive disclosure approach is the best for context efficiency; I wound up with a somewhat tight generic AGENTS.md, and the .cursor/rules individual files with glob matching for file names. Cursor honored those well.

I must have spent a couple hundred on the company dime having the models rephrase/rewrite or change where instructions were found, what made sense as a skill vs a rule, trying to keep things as portable as possible. At this point the Cursor-specific files would need to be ported to a different agent/framework if it needed to be. But the content should be pretty solid.

It was an interesting (and productive) exploration for me

bisonbearOP24d ago

> Seems like the progressive disclosure approach is the best for context efficiency; I wound up with a somewhat tight generic AGENTS.md, and the .cursor/rules individual files with glob matching for file names. Cursor honored those well.

This is also generally where I've landed - keep the AGENTS.md super light, and link out to docs as needed. Same idea with skills as well. Basically, preserve the context window at all costs.

The part I'm curious about is, when we're making the sorts of behavior changes you're describing on shared repos, how do we actually measure and quantify impact? It's one thing to tell the team that the agent should perform better, and it's another to say that you made the agent 5% better across a variety of tasks for every dev in the repo.

fuckinpuppers24d ago

I didn’t have to share it or quantify it… so I didn’t care.

I just relied on different agents/models and kept asking a thorough prompt of “analyze the agents.md, cursorrules, etc and ensure its token efficient and enforces everything” (it was very specifically worded, I may have even asked an agent for how to ask agents for it) and just kept jumping from the 3 big models and medium and high thinking, each one kept finding little things and at one point moved entirely from one strategy to another, if I remember right.

Once I felt good enough I’ve been using it as my setup for my application and it’s been pretty good without any modifications or tweaks. Originally I decided to do this because I got tired realizing that it wasn’t honoring things I told it to. For example “restart the application after every modification to the server code” and it would “forget” to do that often… somehow now I’ve got it really well tuned for my particular codebase and approach to developing.

joshka25d ago· 1 in thread

If you look at the 95% CI on https://marginlab.ai/trackers/codex/ with N=50, it's still pretty huge (+/- 13-14% usually). I suspect it would be difficult to reasonably get a measure that numerically assesses whether an AGENTS.md is good. What you can observe though is whether the model paid attention to certain rules while editing. I.e. did the behavior you're steering away or towards take place.

The hardest thing I think is judging whether your AGENTS.md is still good based on each model release. OpenAI does release prompting guidance however to help this (and have added a skills to apply this to your prompts IIRC)

bisonbearOP24d ago

Yes, agree that low n makes overclaiming a real risk with this sort of optimization loop. Low n results can be useful directionally but can't claim superiority without expanding the dataset. If I were running this for a shared repo with real consequences / value to improving AGENTS.md, instead of just as an experiment, I would expand n by a few factors for training / holdout, depending on expected variation on the tasks.

I'm also noticing similar patterns with needing to update AGENTS.md / skills per model release. E.g with Opus 4.6 -> 4.7, it became much more instruction adherent, so instructions written for the prior model generation might cause unexpected behavior in the new generation. I'm also convinced that an optimal AGENTS.md for Codex is not the same file as an optimized CLAUDE.md for Claude - the model personalities and behaviors are so different that we probably need to tune the instructions differently as well.

jauntywundrkind25d ago· 1 in thread

The fine tuning where we run tests/experiments again and again and again on our prompts, our set-ups: really looking forward to when we can start to compare our amalgamated rigs and harnesses and prompts, all these systems. We are guided by intuition, a desire for structure & clarity & direction we think we add. But we lack common tools to assess and compare.

And even when we do compare, the thermal values, the entropy of our systems: that alone can lead us down very different paths. Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)

bisonbearOP24d ago

> we lack common tools to assess and compare

This has been bothering me for a while - the entire dev community is running on vibes when talking about AI. We're operating in an old paradigm, thinking that smart and logical additions to AGENTS.md result in good agent behavior, when in fact agents behavior is such a black box, that measurement is necessary.

> Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)

Even the rigging is hard to control - Anthropic has an interesting piece on this here https://www.anthropic.com/engineering/infrastructure-noise

j / k navigate · click thread line to collapse

7 comments

7 comments · 3 top-level

fuckinpuppers25d ago· 2 in thread

It was an interesting (and productive) exploration for me

bisonbearOP24d ago

This is also generally where I've landed - keep the AGENTS.md super light, and link out to docs as needed. Same idea with skills as well. Basically, preserve the context window at all costs.

fuckinpuppers24d ago

I didn’t have to share it or quantify it… so I didn’t care.

joshka25d ago· 1 in thread

bisonbearOP24d ago

jauntywundrkind25d ago· 1 in thread

bisonbearOP24d ago

> we lack common tools to assess and compare

> Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)

Even the rigging is hard to control - Anthropic has an interesting piece on this here https://www.anthropic.com/engineering/infrastructure-noise

j / k navigate · click thread line to collapse