This particular board was a test escape. A very unique, thankfully singular (so far as I know) test escape, but there's always a chance for that -- the larger failure in my mind was the breakdown in the support process. With our new controls in place (including rapid escalation / RMA of "odd" faults observed by customer) I am confident this won't happen again.
And yes, we're updating tests to catch the newly discovered failure mode. It's an interesting one, the board is a sort of a "zombie" in that if you boot it up by looking at it just right, it'll run stably and pass our stress tests, but it should never have left our factory in that condition due to the other faults. Full stop. :)