I think that "whatever we do" is doing a lot of heavy lifting here. Some of those "whatevers" will be isomorphic to a frame-level analysis that pulls out structural commonalities, or close enough that it's not a clunky reductionist analogy.
I vaguely remember hearing that there's even ways to expand training data like that for neural networks, i.e. by presenting the same source image slightly rotated, partially obscured etc.