I made a crude first stab at an approach that at least uses similar steps and structure to compare the effectiveness of AI agents. My approach was used on a small toy problem, but one that was complex enough the agents couldn't one-shot and required error correction.
It was enough to show significant differences, but scaling this to larger projects and multiple runs would be pretty difficult.
https://mattwigdahl.substack.com/p/claude-code-vs-codex-cli-...