I am not sure how others are doing this, but here is our process:
- meaningful test coverage
- internal software architecture was explicitly baked into the prompts, and we try to not go wild with vibing, but, rather, spec it well, and keep Claude on a short leash
- each feature built was followed by a round of refactoring (with Claude, but with an oversight of an opinionated human). we spend 50% building, 50% refactoring, at least. Sometimes it feels like 30/70%. Code quality matters to us, as those codebases are large and not doing this leads to very noticeable drop in Claude's perceived 'intelligence'.
- performance tests as per usual - designed by our infra engineers, not vibed
- static code analysis, and a hierarchical system of guardrails (small claude.md + lots of files referenced there for various purposes). Not quite fond of how that works, Claude has been always very keen to ignore instructions and go his own way (see: "short leash, refactor often").
- pentests with regular human beings
The one project I mentioned - 2 months for a complete rewrite - was about a week of working on the code and almost 2 months spent on reviews, tests, and of course some of that time was wasted as we were doing this for the first time for such a large codebase. The rewritten app is doing fine in production for a while now.
I can only compare the outputs to the quality of the outputs of our regular engineering teams. It compares fine vs. good dev teams, IMHO.