I'd hazard to guess that the 95% is the reduction in how many parts made it through the first test only to be caught later at the more expensive stage. So instead of binning 100 parts a month at that second stage, they now bin 5 parts a month and catch way more early on.
It sounds like the OP is using ML to identify flaws that simply just occur due to imperfections in the manufacturing process. That's life, it happens. You can know that they will occur without necessarily being able to prevent them because maybe there's some dust or other particulates in the air that deposit into the resin occasionally, or maybe the resin begins to cure and leaves hard spots that form bond flaws. There's heaps of possible reasons. It sounds more like the ML is doing classification of 'this too much of a flaw in a local zone' vs 'this has some flaws but it's still good enough to pass', which is how we operate with casting defects. For example, castings have these things called SCRATA comparitor plates, where you literally look at an 'example' tactile plate, look at your cast item, then mentally decide on a purely qualtative aspect which grade of plate it matches. Here [1] are some bad black and white photos of the plates.
[1] http://www.iron-foundry.com/ASTM-A802-steel-castings-surface...