It doesn't stop them from making stupid mistakes. It does reduce the amount of time
I have to deal with the stupid mistakes that they know how to fix if the problem is pointed out to them, so that I can focus on more focused diffs of cleaner code.
E.g. a real example: The tooling I mentioned at one point early on made the correct functional change, but it's written in Ruby and Ruby allows defining methods multiple times in the same class - the later version just overrides the former. This would of course be a compilation error in most other languages. It's a weakness of using Ruby with a careless (or mindless) developer...
But Rubocop - a linter - will catch it. So forcing all changes through Rubocop and just returning the errors to LLM made it recognise the mistake and delete the old method.
It lowers the cognitive load of the review. Instead of having to wade through and resolve a lot of cruft and make sense of unusually structured code, you can focus on the actual specific changes and subject those to more scrutiny.
And then my plan is to experiment with more semantic checks of the same style as what Rubocop uses, but less prescriptive, of the type "maybe you should pay extra attention here, and explain why this is correct/safe" etc. An example might be to trigger this for any change that involves reading a key or password field or card number whether or not there is a problem with it, and both trigger the LLM to "look twice" and indicate it as an area to pay extra attention to in a human review.
It doesn't need to be perfect, it just need to provide enough of a harness to make it easier for humans in the loop to spot the remaining issues.