So yes, you might get good results in one round, but not over time. What does work is to carefully review the AI's output, although the review needs to be more careful than review of human-written code because the agents are very good at hiding the time bombs they leave behind.
If I instruct the AI to make small modules where I can verify they work, have tests and no side effects - then it is good enough code for me. It works, is readable and can be extended - and will turn into bad code if this is not done with care.
The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code. Notice that there is 3 separate things written about the code, and the code itself…. Keeping all of that correct, coherent, and consistent (with a separate, invariant document that describes your design principles) keeps the model from going off the rails and gives ample opportunity to sense bad smells before they get set in stone.
It’s a token fire and you need a minimum 250k context model… but I still get as much work done in an hour as I used to do in a day, and the code I coauthor is better documented, more maintainable, and more tested than any code I have ever written before.
Some of these checks have caught thousands of the same error, even with the latest Opus 4.7 writing the original code.
You can’t play whack-a-mole with GenAI. You have to start from well-known principles and watch everything it produces. Every module or bounded context has to have its own invariants.
You can’t fully automate software engineering with GenAI. It seems the vast majority of GenAI users think they can and end up in the same place as the OP.
Maybe learn Domain-Driven Design, Event Sourcing, and then try again. The results will be dramatically improved.
The harder the code is to understand, the badder it is (and the more likely it is infested with bugs).
Code that will not be able to evolve for more than one-two years is terrible code. Agents write terrible code while doing a truly impressive job hiding it (including in the tests they write) unless, of course, you keep them under very close supervision.
"Do x" - for baseline, assume this generally does X fine.
"Do X, don't use javascript". - even if X already didn't use javascript, this will often perform worse. It will perform even _more_ worse if X is difficult or unusual to do without javascript even if there is some perfectly serviceable way to do it.
Also, despite "don't use javascript", sometimes it just still uses a little bit of javascript anyway, and usually in a spot that would actually be extremely annoying/inconvenient to them remove that js yourself (when you would've otherwise reconsidered your approach at a higher level, to either use js, or to just want something different that is easier to do without js).
I do observe the same thing. There are a limited number of constraints you can add and once you exceed that, you'll play whack-a-mole if you insist on all of them.
This is why I tend toward a more wu-wei attitude to constraints.
For example:
- Do I really need this constraint?
- How does the agent tend to behave in this scenario it if unconstrained? Is this behavior/result an acceptable pattern for this solution?
- Is the constraint implicitly followed often enough that I can trade spending tokens recovering from a deterministic test that enforces the constraint rather than preemptively state it in the prompt?
If I get into the situation where I need more constraints than can fit in context/attention without the need to regularly play whack-a-mole, then I break the module down into sub-modules with fewer, more specific constraints.
Same way teaching your child to do something is much harder than to just do it yourself.
Except the child learns.
Countless individuals/companies unfortunately do believe it is a replacement still and are unlikely to change their minds.