The elephant in the room, of course, is "where did Sora's dataset come from?"
The position that this is a made-up issue when there are multiple large pending lawsuits about exactly this thing is pretty bizarre.
Having open access to the training data is how you prevent poisoning/biasing of the dataset. People complaining about bad data in the dataset improve the quality of the dataset. That's in addition to the benefit of creators being labeled in the dataset.
Hiding the data from public view seems to only helps nefarious actors.
As in, making workable 3d models is harder than making video.
And it is easier to make a 3d model by generating a video of the object instead.
Why is that? I don't know. But that's the current state of the industry. 3D model generation is simply harder.
> Now for the second point, both DiT and Sora replace the commonly-used U-Net architecture with a vanilla Transformer architecture. This matters because the authors of the DiT paper observe that using Transformers leads to predictable scaling: As you apply more training compute (either by training the model for longer or making the model larger, or both), you obtain better performance. The Sora technical report notes the same but for videos and includes a useful illustration.
> This scaling behavior, which can be quantified by so-called scaling laws, is an important property and it has been studied before in the context of Large Language Models (LLMs) and for autoregressive models on other modalities. The ability to apply scale to obtain better models was one of the key drivers behind the rapid progress on LLMs. Since the same property exists for image and video generation, we should expect the same scaling recipe to work here, too.
If the H100 is $40k worst case, that's one-time cost of $356M! I could definitely see the FAANGs throwing money at this.
This is why Sama said compute is the currency of the future.