Seems like the progressive disclosure approach is the best for context efficiency; I wound up with a somewhat tight generic AGENTS.md, and the .cursor/rules individual files with glob matching for file names. Cursor honored those well.
I must have spent a couple hundred on the company dime having the models rephrase/rewrite or change where instructions were found, what made sense as a skill vs a rule, trying to keep things as portable as possible. At this point the Cursor-specific files would need to be ported to a different agent/framework if it needed to be. But the content should be pretty solid.
It was an interesting (and productive) exploration for me
This is also generally where I've landed - keep the AGENTS.md super light, and link out to docs as needed. Same idea with skills as well. Basically, preserve the context window at all costs.
The part I'm curious about is, when we're making the sorts of behavior changes you're describing on shared repos, how do we actually measure and quantify impact? It's one thing to tell the team that the agent should perform better, and it's another to say that you made the agent 5% better across a variety of tasks for every dev in the repo.
I just relied on different agents/models and kept asking a thorough prompt of “analyze the agents.md, cursorrules, etc and ensure its token efficient and enforces everything” (it was very specifically worded, I may have even asked an agent for how to ask agents for it) and just kept jumping from the 3 big models and medium and high thinking, each one kept finding little things and at one point moved entirely from one strategy to another, if I remember right.
Once I felt good enough I’ve been using it as my setup for my application and it’s been pretty good without any modifications or tweaks. Originally I decided to do this because I got tired realizing that it wasn’t honoring things I told it to. For example “restart the application after every modification to the server code” and it would “forget” to do that often… somehow now I’ve got it really well tuned for my particular codebase and approach to developing.
The hardest thing I think is judging whether your AGENTS.md is still good based on each model release. OpenAI does release prompting guidance however to help this (and have added a skills to apply this to your prompts IIRC)
I'm also noticing similar patterns with needing to update AGENTS.md / skills per model release. E.g with Opus 4.6 -> 4.7, it became much more instruction adherent, so instructions written for the prior model generation might cause unexpected behavior in the new generation. I'm also convinced that an optimal AGENTS.md for Codex is not the same file as an optimized CLAUDE.md for Claude - the model personalities and behaviors are so different that we probably need to tune the instructions differently as well.
And even when we do compare, the thermal values, the entropy of our systems: that alone can lead us down very different paths. Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)
This has been bothering me for a while - the entire dev community is running on vibes when talking about AI. We're operating in an old paradigm, thinking that smart and logical additions to AGENTS.md result in good agent behavior, when in fact agents behavior is such a black box, that measurement is necessary.
> Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)
Even the rigging is hard to control - Anthropic has an interesting piece on this here https://www.anthropic.com/engineering/infrastructure-noise