Heck, it is far simpler than video, because the point of view and frame is fixed.
Further - "a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions." specifically "and actions"
User input is being fed into this system and subsequent frames take that into account. The user is "actually" firing a gun.
There cannot be "video compression artifacts" because it hasn’t even seen any compressed video during training, as far as I can see.
Seriously, how is this even a discussion? The article is clear that the novel thing is that this is real-time frame generation conditioned on the previous frame(s) AND player actions. Just generating video would be nothing new.
I highly suggest you to read the paper briefly before commenting on the topic. The whole point is that it's not just generating a video.