I think you misunderstand me. I don't mean a snap evaluation and deciding between two very-short competing videos which is what the participants were doing. I mean doing an actual analysis of how well the simulation matches the ground truth of the game.
What I'd posit is that it's not actually a very good replication of the game but very good a replicating short clips that almost look like the game and the short time horizons are deliberately chosen because the authors know the model lacks coherence beyond that.