Seems like the definition of success could skew things in a couple of ways. An engineer might use Claude to evaluate an approach and conclude it's not worth pursuing for technical reasons. There'd then be no commit and the session likely doesn't count as verified success even though that was the right call.
And is tests passing a weaker signal than it looks? If I add new functionality without adding tests, existing tests can all pass but it does't tell you whether my new code actually works.
To be fair this is accepted near the end.