How OpenAI's Sora Model Works (opens in new tab)

(factorialfunds.com)

79 pointsmplappert2y ago23 comments

23 comments

A great write up.

The elephant in the room, of course, is "where did Sora's dataset come from?"

This is only a made-up issue for a few that are looking for something to criticize. Almost nobody cares, in the sense that appears to be meant here about "ownership" of the training data. Any this unfortunately hampers research and understanding of models because companies are reluctant to talk about training lest the trolls start jumping on. We're all worse off because of this.

ivraatiems2y ago

Is it really so hard to imagine that someone might not like the idea of their work being used to train a machine to imitate their work with no compensation to them? And that machine is then instead used to benefit the shareholders of large corporations?

The position that this is a made-up issue when there are multiple large pending lawsuits about exactly this thing is pretty bizarre.

altruios2y ago

What a short-sighted view.

Having open access to the training data is how you prevent poisoning/biasing of the dataset. People complaining about bad data in the dataset improve the quality of the dataset. That's in addition to the benefit of creators being labeled in the dataset.

Hiding the data from public view seems to only helps nefarious actors.

andy992y ago

Pretty sure we're saying the same thing

1 more reply

lvl1022y ago

What’s the status on companies building AI models to build actual 3D backend behind these generative videos. Anyone working on something similar? Imagine that’d be far more productive. For example, lookdev mlop is pretty low hanging fruit. Not sure why we don’t already have models from Autodesk, Epic or even Adobe (with resources ie A100/H100) where you upload an image/video and the model spits out workable 3D scaffolds.

stale20022y ago

This is a good question, and the answer is that from a tech side it is surprisingly easier to solve the problem in the reverse direction.

As in, making workable 3d models is harder than making video.

And it is easier to make a 3d model by generating a video of the object instead.

Why is that? I don't know. But that's the current state of the industry. 3D model generation is simply harder.

lvl1022y ago

I am thinking reinforcement on top of Blender would be straight forward with unlimited synthetic data potential. I’ve come across people incorporating SD into rendering workflow so tools are all there.

euW3EeBe2y ago

Probably also helps that there's way more image/video data to train on than 3D data.

huytersd2y ago

If I’m not mistaken, Stability just released something like that a few days ago.

Legend24402y ago

Yes, but it works by generating a video first and doing photogrammetry on it to produce a 3D model.

lvl1022y ago

Looks like I completely overlooked threestudio released last year. Thank you for pointing it out.

huytersd2y ago

It’s something I’ve been interested in too. I do a bunch of CNC woodworking so would love the ability to atleast generate close enough 3D models I can then refine.

htrp2y ago

Nerfs and splatting

mauriciolange2y ago

I don't understand, from the article, how Sora works when handling a rotation of an object on another object (the leaves in the leaf covered elefant for example). The explanation goes only to the diffussion model, but not to how, from that model, a correct geometry deformation is derived at each step.

hackerlight2y ago

I don't get how transformers can replace convolutional networks. My understanding is patches get fed in, and the transformer will do the same thing that a convolution layer does. But transformers deal with sequential data and I don't see any of that here?

Legend24402y ago

Transformers are not limited to sequential data. They can process any form of data you can tokenize, as long as they have enough of it to learn the patterns and structures it contains.

GaggiX2y ago

A transformers model was probably chosen because of its scaling properties and because it's easy to mask the attention layer so that you can fit in the same batch multiple video and images of different lengths and dimensions, this is important because every batch needs to be the same size to fit in the same mini-batch and for performance reasons.

striking2y ago

From the fine article:

> Now for the second point, both DiT and Sora replace the commonly-used U-Net architecture with a vanilla Transformer architecture. This matters because the authors of the DiT paper observe that using Transformers leads to predictable scaling: As you apply more training compute (either by training the model for longer or making the model larger, or both), you obtain better performance. The Sora technical report notes the same but for videos and includes a useful illustration.

> This scaling behavior, which can be quantified by so-called scaling laws, is an important property and it has been studied before in the context of Large Language Models (LLMs) and for autoregressive models on other modalities. The ability to apply scale to obtain better models was one of the key drivers behind the rapid progress on LLMs. Since the same property exists for image and video generation, we should expect the same scaling recipe to work here, too.

cma2y ago

I think it just treats the patches like it would be sequentially in memory or disk, but also has coordinates. And they have overlapping patches at an offset to catch features that would span a patch and be missed at that level.

nunez2y ago

> Total Nvidia H100 needed to support the creator community on TikTok & YouTube: 10.7M / 120 ≈ 89k

If the H100 is $40k worst case, that's one-time cost of $356M! I could definitely see the FAANGs throwing money at this.

gitfan862y ago

Once again we see that scaling laws are the way to better output.

This is why Sama said compute is the currency of the future.

j / k navigate · click thread line to collapse

23 comments

ivraatiems2y ago

A great write up.

The elephant in the room, of course, is "where did Sora's dataset come from?"

andy992y ago

ivraatiems2y ago

The position that this is a made-up issue when there are multiple large pending lawsuits about exactly this thing is pretty bizarre.

altruios2y ago

What a short-sighted view.

Hiding the data from public view seems to only helps nefarious actors.

andy992y ago

Pretty sure we're saying the same thing

1 more reply

lvl1022y ago

stale20022y ago

This is a good question, and the answer is that from a tech side it is surprisingly easier to solve the problem in the reverse direction.

As in, making workable 3d models is harder than making video.

And it is easier to make a 3d model by generating a video of the object instead.

Why is that? I don't know. But that's the current state of the industry. 3D model generation is simply harder.

lvl1022y ago

euW3EeBe2y ago

Probably also helps that there's way more image/video data to train on than 3D data.

huytersd2y ago

If I’m not mistaken, Stability just released something like that a few days ago.

Legend24402y ago

Yes, but it works by generating a video first and doing photogrammetry on it to produce a 3D model.

lvl1022y ago

Looks like I completely overlooked threestudio released last year. Thank you for pointing it out.

huytersd2y ago

It’s something I’ve been interested in too. I do a bunch of CNC woodworking so would love the ability to atleast generate close enough 3D models I can then refine.

htrp2y ago

Nerfs and splatting

mauriciolange2y ago

hackerlight2y ago

Legend24402y ago

Transformers are not limited to sequential data. They can process any form of data you can tokenize, as long as they have enough of it to learn the patterns and structures it contains.

GaggiX2y ago

striking2y ago

From the fine article:

cma2y ago

nunez2y ago

> Total Nvidia H100 needed to support the creator community on TikTok & YouTube: 10.7M / 120 ≈ 89k

If the H100 is $40k worst case, that's one-time cost of $356M! I could definitely see the FAANGs throwing money at this.

gitfan862y ago

Once again we see that scaling laws are the way to better output.

This is why Sama said compute is the currency of the future.

j / k navigate · click thread line to collapse