My point is being able to reliably reproduce copyright works will function as a very practical way to tell if the dataset was corrupted with copyrighted material.
In that way it’ll be a lot easier to prove that a dataset was corrupted, then proving the negative.