> pretty much every human would recognize the "before dragon" scene as being the same place as the "after dragon" scene
Not in a game and those were enemies, it completely changed what and how many they are, people would notice such a massive change instantly if they looked away and suddenly there were 50% more enemies.
> The model clearly shows the ability to go beyond "image-to-image" rendering.
I never argued against that. Adding a third dimension (time) makes generating a video the same kind of problem as generating an image, it is not harder to draw a straight pencil with something covering it than to draw the scene with something covering it for a while.
But still, even though it is that simple, these models are really bad at it, because it requires very large models and much compute. So I just extrapolated based on their current abilities that we know, as you demonstrated there, to say roughly how long until we can even have consistent short videos.
Note that videos wont have the same progression as images, as the early image models were very small and we quickly scaled up there, while now for video we start at really scaled up models and we have to wait until compute gets cheaper/faster the slow way.
> But also completely missed the point.
You completely missed my point or you changed your point afterwards. My point was that current models can only remember little bits under such circumstances, and to remember a whole scene they need to be massively larger. Almost all details in the scene you showed were missed, the large strokes are there but to keep the details around you need an exponentially larger model.