Anyway, about my second question: why are the videos only half second ish long? Does the model unravel after that?
Also
> This is the first version of something that is now possible and will only improve with scale.
11b params is already pretty large considering the stable diffusion and LLM scale. How much higher do we need to scale until we get something useful beyond simple setups?