...because you need a control to evaluate how well your product is doing? I know it's a young field, but boy, do some folk love removing the "science" from "data science"
I'm not claiming that's what happened here, nor am I interested in nitpicking "what counts as 'science'". I'm just saying this is a reasonable thing to do.
Your control in an online environment is the current baseline. You don’t need to save the test set anymore, you can push it online and test it directly.
If it performs about as well in instances it has never seen before (test set) then it's not overfit to the test.
But not an expert or OP!
For a particular model you try to minimally do this by separating a test and validation set, but on a meta-meta level, it's easy to see it happening.
"corroborate", you find queries of the same level which would give satisfactory output upon good performance but fail in a faulty overfitted model.
> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.