The training was over almost 1 billion frames, 20 days of full-time play-time, taking a screenshot of every single inch of the map.
Now you show him N frames as input, and ask it "give me frame N+1", then it gives you the frame n. N+1 back based on how it was originally seen during training.
But it is not frame N+1 from a mysterious intelligence, it's simply frame N+1 given back from past database.
The drift you mentioned is actually a clear (but sad) proof that the model does not work at inventing new frames, and can only spit out an answer from the past dataset.
It's a bit like if you train stable diffusion on Simpsons episodes, and that it outputs the next frame of an existing episode that was in the training set, but few frames later goes wild and buggy.