> This raises a central question: do such tests meaningfully improve issue resolution, or do they mainly mimic a familiar software-development practice while consuming interaction budget?
This is an important question but it's not the one I'm most interested in when requiring agents to follow TDD. My goal is to lock in behavior because it was happening way too frequently that an agent would successfully fix the issue at hand, but break something else that it wasn't supposed to touch.
The tests add another layer and it's why I always separate out red and green worker subagents. The green worker might get trigger happy and go beyond scope/break something but it's not allowed to fudge the tests so I'll know and can clean up and revert.
It's also why I'm not too bothered about perfect red green TDD. I can add the tests later if needed.