Uh, maybe because monster death animations make up a small part of the training material (ie. gameplay) so the model has not learned to reproduce them very well?
There cannot be "video compression artifacts" because it hasn’t even seen any compressed video during training, as far as I can see.
Seriously, how is this even a discussion? The article is clear that the novel thing is that this is real-time frame generation conditioned on the previous frame(s) AND player actions. Just generating video would be nothing new.