What you will find is that the agent is much more successful in this regard.
The LLM has certain intrinsic abilities that match us and like us it cannot actually code 10,000 lines of code and have everything working in one go. It does better when you develop incrementally and verify each increment. The smaller the increments the better it performs.
Unfortunately the chain of thought process doesn’t really do this. It can come up with steps, sometimes the steps are too big and it almost never properly verifies things are working after each increment. That’s why you have to put yourself in the loop here.
Like allowing the computer to run test and verify an application works as expected on each step and to even come up with what verification means is a bit of what’s missing here and I think although this part isn’t automated yet, it can easily be automated where humans become less and less involved and distance themselves into a more and more supervisory role.