I think that if you read the article linked, you'd find that the study didn't work that way. Any study that measures different things between experiment and control groups is not going to be sound.
Then it gets to the field, and whatever you didn't catch becomes a 3-alarm fire.
I've worked on TDD projects, and those are definitely are counted as bugs. Severity 1.