Heck, it is far simpler than video, because the point of view and frame is fixed.
Further - "a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions." specifically "and actions"
User input is being fed into this system and subsequent frames take that into account. The user is "actually" firing a gun.
I highly suggest you to read the paper briefly before commenting on the topic. The whole point is that it's not just generating a video.