Yes, they take the state of the video every 16 frames and look at its embedding. These were made into checkpoints.
The AI is rewarded if at each checkpoint the state vector its produced is sufficiently aligned with the videos.
I guess that's the initial training to deal with sparse rewards.