I don’t know if that would really help, I have a hard time imagining exactly what that model would be doing in practise.
To be honest none of the stuff in the paper is very practical, you almost certainly do not want a diffusion model trying to be an entire game under any circumstances.
What you might want to do is use a diffusion model to transform a low poly, low fidelity game world into something photorealistic. So the geometry, player movement and physics etc would all make sense, and then the model paints over it something that looks like reality based on some primitive texture cues in the low fidelity render.
I’d bet money that something like that will happen and it is the future of games and video.