The paper "Making Contextual Decisions with Low Technical Debt" https://arxiv.org/pdf/1606.03966.pdf goes deeper. Testing and monitoring deployments are very similar. The idea of shadow testing new models (seeing how their outputs would differ from the production models on real data) has been very important for identifying issues in my experience.
This can be generalised to comparing models on historic data which greatly speeds up evaluation. This is different from cross-validation as it is not about correctness, just how different the new output is. This is like the pattern in UX development of a test harness that compares differences in a screenhot. If the differences look good, then ship it!
"Testing" in ML as it stands just now is essentially still development & application logic of the learned function, not testing in the same sense as we consider it in other aspects of software. The "post-train" area will need to a see lot of advances if we're to remain confident in our ML models in production (provided they continue to proliferate into more areas of software).