Agents just can't currently do that well. When you run into a problem when evolving the code to add a new feature or fix a bug, you need to decide whether the change belongs in the architecture or should be done locally. Agents are about as good as a random choice in picking the right answer, and there's typically only one right answer. They simply don't have the judgment. Sometimes you get the wrong choice in one session and the right choice in another.
But this happens at all levels because there are many more than just two abstraction levels. E.g. do I change a subroutine's signature or do I change the callsite? Agents get it wrong. A lot.
Another thing they just don't get (because they're so focused on task success) is that it's very often better to let things go wrong in a way that could inform changes rather than get things to "work" in a way that hides the problem. One of the reasons agent code needs to be reviewed even more carefully than human code is that they're really good at hiding issues with potentially catastrophic consequences.