It's not super clear from the landing page, but I think it's an engine? Like, its input is both previous images and input for the next frame.
So as a player, if you press "shoot", the diffusion engine need to output an image where the monster in front of you takes damage/dies.