Video generation models as world simulators (opens in new tab)

(openai.com)

361 pointslinksbro2y ago168 comments

168 comments

I think people might be missing what this enables. It can make plausible continuations of video, with realistic physics. What happens if this gets fast enough to work _in real time_.

Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

You can probably already imagine different ways to wire the output to text generation and controlling its own motions, etc, and predicting outcomes based on actions it, itself could plausibly take, and choosing the best one.

It doesn't actually have to generate realistic imagery or imagery that doesn't have any mistakes or imagery that's high definition to be used in that way. How realistic is our own imagination of the world?

Edit: I'm going to add a specific case. Imagine a house cleaning robot. It starts with an image of your living room. Then it creates a image of your living room after it's been cleaned. Then it interpolates a video _imagining itself cleaning the room_, then acts as much as it can to mimic what's in the video, then generates a new continuation, then acts, and so on. Imagine doing that several times a second, if necessary.

margorczynski2y ago

You're talking about an agent with a world model used for planning. Actually generating realistic images is not really needed as the world model operates in its own compressed abstraction.

Check out V-Jepa for such a system: https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...

shreezus2y ago

V-Jepa is actually super impressive. I have nothing but respect for Yann LeCun & his team, they really have been on a rampage lately.

1 more reply

Buttons8402y ago

> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

In theory, yes. The problem is we've had AGI many times before, in theory. For example, Q learning, feed the state of any game or system through a neural network, have it predict possible future rewards, iteratively improve the accuracy of the reward predictions, and boom, eventually you arrive at the optimal behavior for any system. We've know this since... the 70's maybe? I don't know how far Q-learning goes back.

I like to do experiments with reinforcement learning and it's always exciting to think "once I turn this thing on, it's going to work well and find lots of neat solutions to the problem", and the thing is, it's true, that might happen, but usually it doesn't. Usually I see some signs of learning, but it fails to come up with anything spectacular.

I keep watching for a strong AI in a video game like Civilization as a sign that AI can solve problems in a highly complex system while also being practical enough that game creators are able to implement it in a practical way. Yes, maybe, maybe, a team with experts could solve Civilization as a research project, but that's far from being practical. Do you think we'll be able to show an AI a video of people playing Civilization and have the video predict the best moves before the AI in the game is able to predict the best moves?

corimaith2y ago

Tbh I don't think an AI for Civ would that impressive, my experience is that most of time you can get away with making locally optimal decisions I.e growing your economy and fighting weaker states. The problem with current civ AI is that their economies can be often structured nonsensically, but optimized economies is usually just the matter of stacking bonuses together into specialized production zones, which can be solved via conventional algorithms.

2 more replies

bigyikes2y ago

I’ve been dying for someone to make a Civilization AI.

It might not be too crazy of an idea - would love to see a model fine-tuned on sequences of moves.

The biggest limitation of video game AI currently is not theory, but hardware. Once home compute doubles a few more times, we’ll all be running GPT-4 locally and a competent Civilization AI starts to look realistic.

Baeocystin2y ago

I am 100% certain that the training of such an AI will result in winning a game without ever building a single city* and 1,000 other exploits before being nerfbatted enough to play a 'real' game.

(That doesn't mean I don't want to see the ridiculousness it comes up with!)

* https://www.youtube.com/watch?v=6CZEEvZqJC0

3 more replies

Buttons8402y ago

I think it's also a matter of "shape". Like, GPT4 solves one "shape" of problem, given tokens, predict the next token. That's all it does, that's the only problem it has to solve.

A Civilization AI would have many problem "shapes". What do I research? Where do I build my city, what buildings do I build, how do I move my units, what units do I build, what improvements do I build, when do I declare war, what trade deals do I accept, etc, etc. Each of those is fundamentally different, and you can maybe come up with a scheme to make them all into the same "shape", but then that ends up being harder to train. I would be interested to see a good solution to this problem.

1 more reply

amelius2y ago

> I’ve been dying for someone to make a Civilization AI.

Would love to see someone to make an AI that can predict our economy, perhaps by modeling all the actors that participate in the economy using AI agents.

YeGoblynQueenne2y ago

>> Usually I see some signs of learning, but it fails to come up with anything spectacular.

And even if it succeeds, it fails again as soon as you change the environment because RL doesn't generalise. At all. It's kind of shocking to be honest.

https://robertkirk.github.io/2022/01/17/generalisation-in-re...

LarsDu882y ago

What I find interesting is that b/c we have so much video data, we have this thing that can project the future in 2d pixel space.

Projecting into the future in 3d world space is actually what the endgame for robotics is and I imagine depending on how complex that 3d world model is, a working model for projecting into 3d space could be waaaaaay smaller.

It's just that the equivalent data is not as easily available on the internet :)

uoaei2y ago

That's what estimation and simulation is for. Obviously that's not what's happening in TFA but it's perfectly plausible today.

Not sure how people are concluding that realistic physics is feasible operating solely in pixel space, because obviously it can't and anyone with any experience training such models would recognize instantly the local optimum these demos represent. The point of inductive bias is to make the loss function as convex as possible by inducing a parametrization that is "natural" to the system being modeled. Physics is exactly the attempt to formalize such a model borne of human cognitive faculties and it's hard to imagine that you can do better with less fidelity by just throwing more parameters and data at the problem, especially when the parametrization is so incongruent to the inherent dynamics at play.

pzo2y ago

Depth estimation improved a lot as well e.g. with Depth-Anything [0]. But those are mostly relative depth instead of metric. Also when even converted to metric they still seems have a lot of pointclouds at the edges that have to be pruned - visible in this blog [1]. Looks like those models trained on Lidar or Stereo depthmaps that has this limitations. I think we don't have enough clean training data for 3d unless we maybe train on synthetic data (then we can have plenty, generate realistic scene in Unreal Engine 5 and train on rendered 2d frames)

[0] https://github.com/LiheYoung/Depth-Anything

[1] https://medium.com/@patriciogv/the-state-of-the-art-of-depth...

samus2y ago

There are also models that are trained to generate 3D models from a picture. Use it on videos, and also train it on output generated by video games.

mgoetzke2y ago

imagine it going a few dimensions further, what will happen when i tell this person 'this'. how will this affect the social graph and my world state :)

YeGoblynQueenne2y ago

>> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

As another comment points out that's Yann LeCun's idea of "Objective-Driven AI" introduced in [1] though not named that in the paper (LeCun has named it that way in talks and slides). LeCun has also said that this won't be achieved with generative models. So, either 1 out of 2 right, or both wrong, one way or another.

For me, I've been in AI long enough to remember many such breakthroughs that would lead to AGI before - from DeepBlue (actually) to CNNs, to Deep RL, to LLMs just now, etc. Either all those were not the breakthroughs people thought at the time, or it takes many more than an engineering breakthrough to get to AGI, otherwise it's hard to explain why the field keeps losing its mind about the Next Big Thing and then forgetting about it a few years later, when the Next Next Big Thing comes around.

But, enough with my cynicism. You think that idea can work? Try it out. In a simplified environment. Take some stupid grid world, a simplification of a text-based game like Nethack [2] and try to implement your idea, in-vitro, as it were. See how well it works. You could write a paper about it.

____________________

[1] https://openreview.net/pdf?id=BZ5a1r-kVsf

[2] Obviously don't start with Nethack itself because that's damn hard for "AI".

nopinsight2y ago

I totally agree that a system like Sora is needed. By itself, it’s insufficient. With a multimodal model that can reason properly, then we get AGI or rather ASI (artificial super intelligence) due to many advantages over humans such as context length, access to additional sensory modalities (infrared, electroreception, etc), much broader expertise, huge bandwidth, etc.

future successor to Sora + likely successor to GPT-4 = ASI

See my other comment here: https://news.ycombinator.com/item?id=39391971

semi-extrinsic2y ago

I call bullshit.

A key element of anything that can be classified as "general intelligence" is developing internally consistent and self-contained agency, and then being able to act on that. Today we have absolutely no idea of how to do this in AI. Even the tiniest of worms and insects demonstrate capabilities several orders of magnitude beyond what our largest AIs can.

We are about as close to AGI as James Watt was to nuclear fusion.

nopinsight2y ago

A definition of general intelligence may or may not include agency to act. There is no consensus on that. To learn and to predict, yes, but not necessarily to act.

Does someone with Locked-In Syndrome (LIS) continue to be intelligent? I’d say yes.

Obviously, agency to act might be instrumental for learning and predicting especially early in the life of an AI or a human, but beyond a certain point, internal simulations could substitute for that.

jiggawatts2y ago

Adding to this: Sora was most likely trained on video that's more like what you'd normally see on YouTube or in a clip art or media licensing company collection. Basically, video designed to look good as a part of a film or similar production.

So right now, Sora is predicting "Hollywood style" content, with cuts, camera motions, etc... all much like what you'd expect to see in an edited film.

Nothing stops someone (including OpenAI) from training the same architecture with "real world captures".

Imagine telling a bunch of warehouse workers that for "safety" they all need to wear a GoPro-like action camera on their helmets that record everything inside the work area. Run that in a bunch of warehouses with varying sizes, content, and forklifts, and then pump all of that through this architecture to train it. Include the instructions given to the staff from the ERP system as well as the transcribed audio as the text prompt.

Ta-da.

You have yourself an AI that can control a robot using the same action camera as its vision input. It will be able to follow instructions from the ERP, listen to spoken instructions, and even respond with a natural voice. It'll even be able to handle scenarios such as spills, breaks, or other accidents... just like the humans in its training data did. This is basically what vehicle auto-pilots do, but on steroids.

Sure, the computer power required for this is outrageously expensive right now, but give it ten to twenty years and... no more manual labour.

verticalscaler2y ago

A 3d model with object permanency is definitely a step in the right direction of something or other but for clarity let us dial back down the level of graphical detail.

A Pacman bot is not AGI. Might get it to eat all the dots correctly where as before if something scrolled off the screen it'd forget about it and glitch out - but you didn't fan any flames of consciousness into existence as of yet.

Solvency2y ago

Is a human that manages to eat all the skittles and walk without falling into deadly holes AGI? Why?

1 more reply

coffeebeqn2y ago

The flip side of video or image gen is always video or image identification. If video gets really good then an AI can have quite an accurate visual view into the world in real time

zmgsabst2y ago

That’s how we think:

Imagine where you want to be (eg, “I scored a goal!”) from where you are now, visualize how you’ll get there (eg, a trick and then a shot), then do that.

adi_pradhan2y ago

Thanks for adding the specific case. I think with testing these sort of limited domain applications make sense.

It'll be much harder for more open ended world problems where the physics encountered may be rare enough in the dataset that the simulation breaks unexpectedly. For example a glass smashing into the floor. The model doesn't simulate that causally afaik

therein2y ago

> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting

There was that article a few months ago about how basically that's what the cerebellum does.

mdorazio2y ago

FWIW, you've basically described at a high level exactly what autonomous driving systems have been doing for several years. I don't think anyone would say that Waymo's cars are really close to AGI.

neom2y ago

Figure out how to incorporate a quantum computer as a prediction engine in this idea, and you've got quite the robot on your hands. :)

(and throw this in for good measure https://www.wired.com/story/this-lab-grown-skin-could-revolu... heh)

liamYC2y ago

This comment is brilliant. Thank you. I’m so excited now to build a bot that uses predictive video. I wonder what the most simple prototype would be? Surely one that has a simple validation loop that can say hey, this predicted video became true. Perhaps a 2D infinite scrolling video game?

anirudhv272y ago

Imagine having real-time transfer of characteristics within your world in a VR/mixed reality setup. Automatically generating new views within the environment you are currently in could create pretty interesting experiences.

metabagel2y ago

This sounds like it has military applications, not that I’m excited at the prospect.

pyinstallwoes2y ago

So basically a brain in a vat, reality as we experience it, our thoughts as prompts.

deadbabe2y ago

Imagine putting on some AR goggles

staring at a painting in a Museum

Then immediately jumping into an entire VR world based off the painting generated by an AI rendering it out on the fly

jimmySixDOF2y ago

BlockadeLabs has been doing a 3D text to skybox and not exactly runtime at the moment but I have seen it work in a headset and it definitely feels like the future.

blueprint2y ago

how would you define AGI?

aurareturn2y ago

Sounds like simulation theory is closer and closer to being proven.

littlestymaar2y ago

Our ability to build somewhat convincing simulations of thing has never been a proof of living in a simulation…

empath-nirvana2y ago

i mean everyone's mind builds a convincing internal simulation of reality and it's so good that most people think they're directly experiencing reality.

2 more replies

fnordpiglet2y ago

Except there is always an original at the root. There’s no way to prove that’s not us.

number_man2y ago

The root world can spawn many simulations and simulations can be spawned within simulations. It becomes far more likely that we exist in a simulated world than in the root world.

2 more replies

svantana2y ago

Haven't you heard of "turtles all the way down"?

NiloCK2y ago

Something something linked list with loop.

SushiHippie2y ago

I like that this one shows some "fails", and not just the top of the top results:

For example, the surfer is surfing in the air at the end:

https://cdn.openai.com/tmp/s/prompting_7.mp4

Or this "breaking" glass that does not break, but spills liquid in some weird way:

https://cdn.openai.com/tmp/s/discussion_0.mp4

Or the way this person walks:

https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...

Or wherever this map is coming from:

https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...

chkaloon2y ago

I've also noticed on some of the featured videos that there are some perspective/parallax errors. The human subjects in some are either oversized compared to background people, or they end up on horizontal planes that don't line up properly. It's actually a bit vertigo-inducing! It is still very remarkable

danans2y ago

> Or wherever this map is coming from:

> https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...

Notice also that that a roughly 6 seconds there is a third hand putting the map away.

mr_toad2y ago

> For example, the surfer is surfing in the air at the end

Maybe it’s been watching snowboarding videos and doesn’t quite understand the difference.

SiempreViernes2y ago

> Or the way this person walks:

> https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...

Also, why does she have a umbrella sticking out from her lower back?

zenon2y ago

I suppose the lady usually has an umbrella in this kind of situation, so it felt the umbrella should be included in some way: https://youtu.be/492tGcBP568

YeGoblynQueenne2y ago

In truth, that's not a woman in a green dress, it's a bunch of penguins disguised as a woman in a green dress. That explains the gait. As to the umbrella, they assumed that humans, intelligent as we are, always carry polar bear protection around.

sega_sai2y ago

That is creepy...

SiempreViernes2y ago

Take a look at these adorable kangaroos to relax:

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-...

sho_hn2y ago

Is the guy with the arm sticking out of his shoulder the relaxing part? :-)

SushiHippie2y ago

I think you forgot the most adorable one /s

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...

no but for real this seems to be the most "adorable" one:

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...

hackerlight2y ago

Where do you find the last two?

SushiHippie2y ago

Part of this website changes after the video finished and switches to the next video. There is no way to control it. These are both "X wearing Y taking a pleasant stroll in Z during W"

1 more reply

coffeebeqn2y ago

The hyper realistic and plausible movement of the glass breaking makes this bizarrely fascinating. And it doesn’t give me the feeling of disgust the motion in the more primitive AI models did

1 more reply

modeless2y ago

> Other interactions, like eating food, do not always yield correct changes in object state

So this is why they haven't shown Will Smith eating spaghetti.

> These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world

This is exciting for robotics. But an even closer application would be filling holes in gaussian splatting scenes. If you want to make a 3D walkthrough of a space you need to take hundreds to thousands of photos with seamless coverage of every possible angle, and you're still guaranteed to miss some. Seems like a model this capable could easily produce plausible reconstructions of hidden corners or close up detail or other things that would just be holes or blurry parts in a standard reconstruction. You might only need five or ten regular photos of a place to get a completely seamless and realistic 3D scene that you could explore from any angle. You could also do things like subtract people or other unwanted objects from the scene. Such an extrapolated reconstruction might not be completely faithful to reality in every detail, but I think this could enable lots of applications regardless.

SiempreViernes2y ago

Do note that "reconstruction" is not the right word, the proper characterisation of that sort of imputation is "artist impression": good for situations where the precise details doesn't matter. Though of course if the details doesn't matter maybe blurry is fine.

YeGoblynQueenne2y ago

Well, yeah, if the details don't matter then you don't need "highly-capable simulators of the physical and digital world". And if the details do matter, then good luck having a good enough simulation of the real world that you can invoke in real time in any kind of mobile hardware.

nopinsight2y ago

AlphaGo and AlphaZero were able to achieve superhuman performance due to the availability of perfect simulators for the game of Go. There is no such simulator for the real world we live in (although pure LLMs sort of learn a rough, abstract representation of the world as perceived by humans.) Sora is an attempt to build such a simulator using deep learning.

  “Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”

General, superhuman robotic capabilities on the software side can be achieved once such a simulator is good enough. (Whether that can be achieved with this approach is still not certain.)

Why superhuman? Larger context length than our working memory is an obvious one, but there will likely be other advantages such as using alternative sensory modalities and more granular simulation of details unfamiliar to most humans.

Nathanba2y ago

Really interesting how this goes against my intuition. I would have imagined that it's infinitely easier to analyze a camera stream of the real world, then generate a polygonal representation of what you see (like you would do for a videogame) and then make AI decisions for that geometry. Instead the way that AI is going they rather skip it all and work directly on pixel data. Understanding of 3d geometry, perspective and physics is expected to evolve naturally from the training data.

rasmusfaber2y ago

Another instance of the bitter lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

stravant2y ago

> then generate a polygonal representation of what you see

It's really not that surprising since, to be honest, meshes suck.

They're pretty general graphs but to actually work nicely they have to have really specific topological characteristics. Half of the work you do with meshes is repeatedly coaxing them back into a sane topology after editing them.

1 more reply

roenxi2y ago

There is a perfect simulator of the real world available. It can be recorded with a camera! Once the researchers have a bit of time to get their bearings and figure out how to train an order of magnitude faster we'll get there.

throwaway2902y ago

That's still not a simulation if camera recording shows only what we see.

guybedo2y ago

I think it's Ylecun who stated a few times that video was the better way to train large models as it's more information dense.

The results really are impressive. Being able to generate such high quality videos, to extend videos in the past and in the future shows how much the model "understands" the real world, objects interaction, 3D composition, etc...

Although image generation already requires the model to know a lot about the world, i think there's really a huge gap with video generation where the model needs to "know" 3D, objects movements and interactions.

iliane52y ago

Watching an entirely generated video of someone painting is crazy.

I can't wait to play with this but I can't even imagine how expensive it must be. They're training in full resolution and can generate up to a minute of video.

Seeing how bad video generation was, I expected it would take a few more years to get to this but it seems like this is another case of "Add data & compute"(TM) where transformers prove once again they'll learn everything and be great at it

data-ottawa2y ago

I know the main post has been getting a lot of reaction, but this page absolutely blew me away. The results are striking.

The robot examples are very underwhelming, but the people and background people are all very well done, and at a level much better than most static image diffusion models produce. Generating the same people as the interact with objects is also not something I expected a model like this to do well so soon.

lairv2y ago

I find it wild that this model does not have explicit 3D prior, yet learns to generate videos with such 3D consistency, you can directly train a 3D representation (NeRF-like) from those videos: https://twitter.com/BenMildenhall/status/1758224827788468722

Nihilartikel2y ago

I was similarly astonished at this adaptation of stable diffusion to make HDR spherical environment maps from existing images- https://diffusionlight.github.io/

The crazy thing is that they do it by prompting the model to in paint a chrome sphere into the middle of the image to reflect what is behind the camera! The model can interpret the context and dream up what is plausibly in the whole environment.

emadm2y ago

Yeah we were surprised by that, video models are great 3d prices and image models are great video model priors

larschdk2y ago

You aren't looking carefully enough. I find so many inconsistencies in these examples. Perspectives that are completely wrong when the camera rotates. Windows that shift perspective, patios that are suddenly deep/shallow. Shadows that appear/disappear as the camera shifts. In other examples; paths, objects, people suddenly appearing or disappearing out of nowhere. A stone turning into a person. A horse that suddenly has a second head, then becomes a separate horse with only two legs.

It is impressive at a glance, but if you pay attention, it is more like dreaming than realism (images conjured out of other images, without attention to long term temporal, spatial, and causal consistency). I'm hardly more impressed that Google's deep dream, which is 10 years old.

lairv2y ago

You can literally run 3D algorithms like NeRF or COLMAP on those videos (check the tweet I sent), it's not my opinion, those videos are sufficiently 3D consistent that you can extract 3D geometry from them

Surely it's not perfect, but this was not the case for previous video generation algorithms

beezlebroxxxxxx2y ago

Yeah, it seems to have a hard time processing lens distortion in particular which gives a very weird quality. It's actually bending things, or trying to fill in the gaps, instead of distorting the image in the "correct" way.

crooked-v2y ago

That leaves me wondering if it'd be possible to get some variant of the model to directly output 3D meshes and camera animation instead of an image.

nodja2y ago

This is also true for 2D diffusion models[1]. I suppose they need to understand how 3d works for stuff like lighting/shadows/object occlusion, etc.

[1] https://dreamfusion3d.github.io/

TOMDM2y ago

I wonder how much it'd improve if trained on stereo image data.

QuadmasterXLII2y ago

Moving camera is just stereo.

pedrovhb2y ago

That's an interesting idea. Analogous to how LLMs are simply "text predictors" but end up having to learn a model of language and the world to correctly predict cohesive text, it makes sense that "video predictors" also have to learn a model of the world that makes sense. I wonder how many orders of magnitude further they have to evolve to be similarly useful.

anonyfox2y ago

If they would allow this (maybe a premium+ model) they could soon destroy the whole porn industry. not the websites, but the (often abused) sex workers. Everyone could describe that fetish they are into and get it visualized instantly without the need of physical human suffering to produce these videos.

I know its a delicate topic people (especially in the US) don't want to speak about at all, but damn, this is a giant market and could do humanity good if done well.

kenning2y ago

the reaction of a pedo or worse lol

michalf62y ago

There are thousands of porn consumers with destroyed reward circuitry per every porn actor, of which few are mistreated and the majority are compensated very well.

Producing a neverending supply of wireheading-like addictive stimuli is the farthest possible thing from a good for humanity.

Want to do good in this area - work on ways to limit consumption.

93po2y ago

This is easily proven inaccurate. The largest network of sex workers is OnlyFans. Go look up stats for OF creators. It's 99.5% people making basically zero money, and 0.01% making 99% of the money.

zone4112y ago

Video will be especially important for language models to grasp physical actions that are instinctive and obvious to humans but not explicitly detailed in text or video captions. I mentioned this in 2022:

https://twitter.com/LechMazur/status/1607929403421462528

https://twitter.com/LechMazur/status/1619032477951213568

dang2y ago

Related ongoing thread:

Sora: Creating video from text - https://news.ycombinator.com/item?id=39386156 - Feb 2024 (1430 comments)

GaggiX2y ago

The Minecraft demo makes me think that soon will be playing games directly from the output of one of these models, unlimited content.

kevingadd2y ago

While it seems plausible that eventually you could build a game around one of these models, the lack of an underlying state representation that you can permute in a precise way is a pretty strong barrier to anything resembling real user-and-system interaction. Even expressing pong through text prompts in a way that would produce desirable results in this is a tough challenge.

I could imagine a text adventure game with a 'visual' component perhaps working if you got the model to maintain enough consistency in spaces and character appearances.

Jordan-1172y ago

Imagine something like GAN Theft Auto V powered by this technology:

https://www.youtube.com/watch?v=udPY5rQVoW0

koonsolo2y ago

Yesterday I was watching a movie on Netflix and thought to myself, what if Netflix generated a movie based on what I want to see and what I like.

Plus, it could generate it in real time and take my responses into account. I look bored? Spice it up, etc.

Today such a thing seems closer than I thought.

binary1322y ago

Maybe this says more about me than about the technology, but I found the consistency of the Minecraft simulation super impressive.

lanternfish2y ago

I was wondering how feasible it would be to make a Minecraft agent that had a running feed of the past few seconds, continued it off w/ SORA, fed the continuation into a (relatively) simple policy translator that just pulled out what the video showed as player inputs, and the inputted that.

Presumably, this would work for non-minecraft applications, but Minecraft has a really standardized interface layer

binary1322y ago

It’s interesting to think about what the internal model of a NN of a game might be like, and whether it could be exposed with an API that someone could program against.

chankstein382y ago

This is the second Sora announcement I've seen. Am I missing how I can play with it? The examples in the papers are all well and good but I want to get my hands on it and try it.

proc02y ago

I don't know if there is research into this, didn't see it mentioned here, but this is the most probable path to something like AI consciousness and AGI. Of course it's highly speculative but video to world simulation is how the brain evolved and probably what is needed to have a robot behave like a living being. It would just do this in reverse, video input to inner world model, and use that for reasoning about the world. Extremely fascinating, and also scary this is happening so quickly.

myth_drannon2y ago

Should I short all the 3d tool s/movies/vfx companies?

drcode2y ago

I think not in the short term- I'm guessing the next step will be to use traditional tools to make a "draft" of a desired video, then "finish it off" with this kind of deep learning tech.

So this tech will increase interest in existing 3d tools in the short term

hackerlight2y ago

People working in vfx are incredibly gloomy today, they see the writing on the wall now, whether it's 1 or 5 years. There will still be a demand for human-created stuff but many of the jobs in advertising and stock footage will disappear.

nojs2y ago

Arguably you should go long since once they integrate this into their products (as Adobe is doing) they have the distribution in place to monetise it, industry knowledge to combine it with existing workflows, etc.

colesantiago2y ago

Damn, even minecraft videos being simulated, this is crazy to see from OpenAI.

Edit, changed the links to the direct ones!

https://cdn.openai.com/tmp/s/simulation_6.mp4

https://cdn.openai.com/tmp/s/simulation_7.mp4

cptroot2y ago

As someone who's played probably too many hours of minecraft, these videos are nauseating. The way that all of the individual pieces exist, but have no consistency is terrifying. Random FoV changes, switching apparent texture packs, raytracing on or off, it's all still switching back and forth from moment to moment.

These videos honestly give me less confidence in the approach, simply because I don't know that the model will be able to distinguish these world model parameters as "unchanging".

bongodongobob2y ago

As someone who is a human born in the 80s, this shit wasn't supposed to even be theoretically possible.

People didn't think this was possible even a year ago.

"Gives me less confidence"

Cmon man...

SushiHippie2y ago

But it starts to make sense, when you think about the fact that Minecraft is owned by Microsoft.

example video links from TFA:

https://cdn.openai.com/tmp/s/simulation_6.mp4

https://cdn.openai.com/tmp/s/simulation_7.mp4

empath-nirvana2y ago

I don't think that has anything to do with it. There's just millions of hours of minecraft video on youtube and twitch.

SushiHippie2y ago

But If they wouldn't have the ok from Microsoft there would be a chance that they start talking about copyright infringement.

pmontra2y ago

The video with the two MTBs going downhill: it seems to me that the long left turn that begins a few second into the video is way too long. It's easy to misjudge that kind of things (try to draw a road race track by looking at a single lap of it) but it could end up below the point where it started, or too close to it to be physically realistic. I was expecting to see a right turn at any moment but it kept going left. It could be another consequence of the lack of real knowledge about the world, similar to the glass shattering example at the end of the article.

htrp2y ago

> We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.

Every cv preprocessing pipeline is in shambles now.

vunderba2y ago

The improvement to temporal consistency given that the length of these generated videos is 3 to 4 times longer than anything else on the market (runway, pika, etc) is truly remarkable.

sjwhevvvvvsj2y ago

This is insanely good but look at the legs around 16 seconds in, they kinda morph through each other. Generally the legs are slightly unnerving.

Still, god damn.

tomxor2y ago

Yeah I saw that too, it does it in one other place, the feet are also sort of gliding around. If you look at the people in the background a lot of them are doing the same, and there are other temporal-mechanical inconsistencies like joints inverting half way through a movement, i guess due to it operating in 2D so when things are foreshortened from the camera angle they have the opportunity to suddenly jump into the wrong position, like twitchy inverse kinematics.

Everything also has a sort of mushy feel about it, like nothing is anchored down, everything is swimming. Impressive all the same, but maybe these methods need to be paired with some old fashioned 3d rendering to serve as physically based guideline.

danavar2y ago

While the Sora videos are impressive, are these really world simulators? While some notion of real-world physics probably exists somewhere within the model, doesn’t all the completely artificial training data corrupt it?

Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.

This is just a contrived, interesting viewpoint of the technology, right?

mr_toad2y ago

“it does not accurately model the physics of many basic interactions, like glass shattering.”

Vecr2y ago

> Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.

That's not true, AI systems in general have pretty strong mathematical proofs going back decades on what they can theoretically do, the problem is compute and general feasibility. AIXItl in theory would be able to learn reasoning, logic, formal systems, physics, human emotions, and a great deal of everything else just from watching videos. They would have to be videos of varied and useful things, but even if they were not, you'd at least get basic reasoning, logic, and physics.

newswasboring2y ago

This is a totally silly thought, but I still want to get it out there.

> Other interactions, like eating food, do not always yield correct changes in object state

Can this be because we just don't shoot a lot of people eating? I think it is general advice to not show people eating on camera for various reasons. I wonder if we know if that kind of topic bias exists in the dataset.

anirudhv272y ago

What makes OpenAI so far ahead of all of these other research firms (or even startups like Pika, Runway, etc.)? I feel like I see so many examples of fields where progress is being made all across and OpenAI suddenly swoops in with an insane breakthrough lightyears ahead of everyone else.

pellucide2y ago

I am a newbie to this area. Honest questions:

Is this generating videos as streaming content e.g. like a mp4 video. As far as I can see, it is doing that. Is it possible for AI to actually produce the 3d models?

What kind of compute resources are required to produce the 3d models.

andybak2y ago

You can feed the output to NeRF or Gaussian Splat generators to produce 3d models:

https://twitter.com/BenMildenhall/status/1758224827788468722

https://twitter.com/ScottieFoxTTV/status/1758272455603327455

The key is that the video has spatial consistency. Once you've got that, then other existing tech can take the output and infer actual spatial forms.

jk_tech2y ago

This is some incredible and fascinating work! The applications seem endless.

1. High quality video or image from text 2. Taking in any content as input and generating forwards/backwards in time 3. Style transformation 4. Digital World simulation!

exe342y ago

The current development of AI seems like speed run of Crystal Society in terms of their interaction with the world. The only thing missing is the Inner Purpose.

neurostimulant2y ago

Where's the training data come from? Youtube?

lbrito2y ago

Okay, The Matrix can't be too far away now.

shiroiushi2y ago

Maybe before long they can make whole movies just with prompts giving a plot outline. Then they can finally make a sequel to The Matrix.

93po2y ago

Finding Nemo except they're all in BDSM outfits

mr_toad2y ago

Having the same cat enter the shot twice would definitely be a glitch in the Matrix.

blueprint2y ago

> Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

so they're gonna include the never-before-observed-but-predicted Unruh effect, as well? and other quantum theory? cool..

> For example, it does not accurately model the physics of many basic interactions, like glass shattering.

... oh

Isn't all of the training predicated on visible, gathered data - rather than theory? if so, I don't think it's right to call these things simulators of the physical world if they don't include physical theory. DFT at least has some roots in theory.

FeepingCreature2y ago

The point isn't "physical simulator" like supercomputers, it's "physical simulator" like the human brain.

blueprint2y ago

yeah, well maybe they should've said that, and they obviously would have if they intended to distinguish between the two of them instead of over inflating their claims in order to raise more money

The simple counter argument is that they're claiming that there is potential for general purpose physical simulation. So please don't play with words. I thought we were supposed to be talking about something concrete instead of whatever you wanted to twist things into to suit your purpose

either that, or they're not qualified to distinguish between the two options you presented, which is obviously not true

right?

the article's bazillion references have nothing to do with physics

nor is there any substance behind the claim of physical simulation

what is actually being simulated is synthesis of any video

it's not that complicated and you and they must know it too

it's double speak easily seen from miles away

scammers have to downvote

liuliu2y ago

Wow, this is really just scale-up DiT. We are going to see tons of similar models very soon.

yakito2y ago

Does anyone know why most of the videos are in slow motion?

93po2y ago

I'm guessing that either the shots its trained on have a propensity to be slo-mo (like puppies playing in a field) or making it slow-motion makes unnatural movement a lot less obvious

tokai2y ago

Ugh, AI generated images everywhere is already annoying enough. Now we're gonna have these factitious videos clogging up everything, and I'll have to explain my old neighbor that Biden did infact not eat a fetus again and again.

SuaveSteve2y ago

100%. It's actually gotten even more dull once they started fixing the fingers. But it's too much; you start to realise that it's just so uninspired. Maybe what this will ultimately do is allow good writers to bring their ideas to life (I hope).

advael2y ago

People are obviously already pointing out the errors in various physical interactions shown in the demo videos, including the research team themselves, and I think the plausiblity of the generated videos will likely improve as they work on the model more. However, I think the major reason this generation -> simulation leap might be harder leap than they think is actually a plausibility/accuracy distinction. Generative models are general and versatile compared to predictive models, but they're intrinsically learning an objective that assesses its extrapolations on spatial or sequential (or in the case of video, both) plausibility, which has a lot more degrees of freedom than accuracy. In other words, the ability to create reasonable-enough hypotheses for what the next frame or the next pixel over could end up not being enough. The optimistic scenario is that it's possible to get to a simulation by narrowing this hypothesis-space enough to accurately model reality. In other words, it's possible that this is just something that could fall out of the plausibility being continuously improved, like the subset of plausible hypotheses shrinks as the model gets better, and eventually we get a reality-predictor, but I think there are good reasons to think that's far from guaranteed. I'd be curious to see what happens if you restrict training data to unaltered camera footage rather than allowing anything fictitious, but the least optimistic possibility is that this kind of capability is necessary but not sufficient for adequate prediction (or slightly more optimistically, can only do so with amounts of resolution that are currently infeasible, or something).

Some of the reasons the less optimistic scenarios seem likely is that the kinds of extrapolation errors this model makes are of similar character to those of LLMs: extrapolation follows a gradient of smooth apparent transitions rather than some underlying logic about the objects portrayed, and sometimes seems to just sort of ignore situations that are far enough outside of what it's seen rather than reconcile them. For example, the tidal wave/historical hall example is a scenario unlikely to have been in the training data. Sure, there's the funny bit at the end where the surfer appears to levitate in the air, but there's a much larger issue with how these two contrasting scenes interact, or rather fail to. What we see looks a lot more like a scene of surfing superimposed via photoshop or something on a still image of the hall, as there's no evidence of the water interacting with the seats or walls in the hall at all. The model will just roll with whatever you tell it to do as best it can, but it's not doing something like modeling "what would happen if" that implausible scenario played out, and even doing it poorly would be a better sign for this doing something like "simulating" the described scenario. Instead, we have impressive results for prompts that likely strongly correspond to scenes the model may have seen, and evidence of a lack of composition in cases where a particular composition is unlikely to have been seen and needs some underlying understanding of how it "would" work that is visible to us

bawana2y ago

SORA.....the entire movie industry is now out of a job.

stephenitis2y ago

Did you actually read the post? This is not the case, yet.

Also the concept of learning to simulating the world seems more important than just the media and content implications.

RayVR2y ago

If there's one thing I've always wanted, it's shitty video knockoffs of real life. Can't wait to stream some AI hallucinations.

danielbln2y ago

As if what you consume normally is actual real life. With blue screens, VFX, etc. you are already watching knockoffs of real life, and the shitty will become indistinguishable from reality before long.

andybak2y ago

So is the only thing you watch on TV, unedited documentary footage?

Also - ironic choice of username considering this comment!

j / k navigate · click thread line to collapse

168 comments

empath-nirvana2y ago

I think people might be missing what this enables. It can make plausible continuations of video, with realistic physics. What happens if this gets fast enough to work _in real time_.

margorczynski2y ago

You're talking about an agent with a world model used for planning. Actually generating realistic images is not really needed as the world model operates in its own compressed abstraction.

Check out V-Jepa for such a system: https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...

shreezus2y ago

V-Jepa is actually super impressive. I have nothing but respect for Yann LeCun & his team, they really have been on a rampage lately.

1 more reply

Buttons8402y ago

corimaith2y ago

2 more replies

bigyikes2y ago

I’ve been dying for someone to make a Civilization AI.

It might not be too crazy of an idea - would love to see a model fine-tuned on sequences of moves.

Baeocystin2y ago

I am 100% certain that the training of such an AI will result in winning a game without ever building a single city* and 1,000 other exploits before being nerfbatted enough to play a 'real' game.

(That doesn't mean I don't want to see the ridiculousness it comes up with!)

* https://www.youtube.com/watch?v=6CZEEvZqJC0

3 more replies

Buttons8402y ago

I think it's also a matter of "shape". Like, GPT4 solves one "shape" of problem, given tokens, predict the next token. That's all it does, that's the only problem it has to solve.

1 more reply

amelius2y ago

> I’ve been dying for someone to make a Civilization AI.

Would love to see someone to make an AI that can predict our economy, perhaps by modeling all the actors that participate in the economy using AI agents.

YeGoblynQueenne2y ago

>> Usually I see some signs of learning, but it fails to come up with anything spectacular.

And even if it succeeds, it fails again as soon as you change the environment because RL doesn't generalise. At all. It's kind of shocking to be honest.

https://robertkirk.github.io/2022/01/17/generalisation-in-re...

LarsDu882y ago

What I find interesting is that b/c we have so much video data, we have this thing that can project the future in 2d pixel space.

It's just that the equivalent data is not as easily available on the internet :)

uoaei2y ago

That's what estimation and simulation is for. Obviously that's not what's happening in TFA but it's perfectly plausible today.

pzo2y ago

[0] https://github.com/LiheYoung/Depth-Anything

[1] https://medium.com/@patriciogv/the-state-of-the-art-of-depth...

samus2y ago

There are also models that are trained to generate 3D models from a picture. Use it on videos, and also train it on output generated by video games.

mgoetzke2y ago

imagine it going a few dimensions further, what will happen when i tell this person 'this'. how will this affect the social graph and my world state :)

YeGoblynQueenne2y ago

____________________

[1] https://openreview.net/pdf?id=BZ5a1r-kVsf

[2] Obviously don't start with Nethack itself because that's damn hard for "AI".

nopinsight2y ago

future successor to Sora + likely successor to GPT-4 = ASI

See my other comment here: https://news.ycombinator.com/item?id=39391971

semi-extrinsic2y ago

I call bullshit.

We are about as close to AGI as James Watt was to nuclear fusion.

nopinsight2y ago

A definition of general intelligence may or may not include agency to act. There is no consensus on that. To learn and to predict, yes, but not necessarily to act.

Does someone with Locked-In Syndrome (LIS) continue to be intelligent? I’d say yes.

jiggawatts2y ago

So right now, Sora is predicting "Hollywood style" content, with cuts, camera motions, etc... all much like what you'd expect to see in an edited film.

Nothing stops someone (including OpenAI) from training the same architecture with "real world captures".

Ta-da.

Sure, the computer power required for this is outrageously expensive right now, but give it ten to twenty years and... no more manual labour.

verticalscaler2y ago

A 3d model with object permanency is definitely a step in the right direction of something or other but for clarity let us dial back down the level of graphical detail.

Solvency2y ago

Is a human that manages to eat all the skittles and walk without falling into deadly holes AGI? Why?

1 more reply

coffeebeqn2y ago

The flip side of video or image gen is always video or image identification. If video gets really good then an AI can have quite an accurate visual view into the world in real time

zmgsabst2y ago

That’s how we think:

Imagine where you want to be (eg, “I scored a goal!”) from where you are now, visualize how you’ll get there (eg, a trick and then a shot), then do that.

adi_pradhan2y ago

Thanks for adding the specific case. I think with testing these sort of limited domain applications make sense.

therein2y ago

> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting

There was that article a few months ago about how basically that's what the cerebellum does.

mdorazio2y ago

FWIW, you've basically described at a high level exactly what autonomous driving systems have been doing for several years. I don't think anyone would say that Waymo's cars are really close to AGI.

neom2y ago

Figure out how to incorporate a quantum computer as a prediction engine in this idea, and you've got quite the robot on your hands. :)

(and throw this in for good measure https://www.wired.com/story/this-lab-grown-skin-could-revolu... heh)

liamYC2y ago

anirudhv272y ago

metabagel2y ago

This sounds like it has military applications, not that I’m excited at the prospect.

pyinstallwoes2y ago

So basically a brain in a vat, reality as we experience it, our thoughts as prompts.

deadbabe2y ago

Imagine putting on some AR goggles

staring at a painting in a Museum

Then immediately jumping into an entire VR world based off the painting generated by an AI rendering it out on the fly

jimmySixDOF2y ago

BlockadeLabs has been doing a 3D text to skybox and not exactly runtime at the moment but I have seen it work in a headset and it definitely feels like the future.

blueprint2y ago

how would you define AGI?

aurareturn2y ago

Sounds like simulation theory is closer and closer to being proven.

littlestymaar2y ago

Our ability to build somewhat convincing simulations of thing has never been a proof of living in a simulation…

empath-nirvana2y ago

i mean everyone's mind builds a convincing internal simulation of reality and it's so good that most people think they're directly experiencing reality.

2 more replies

fnordpiglet2y ago

Except there is always an original at the root. There’s no way to prove that’s not us.

number_man2y ago

The root world can spawn many simulations and simulations can be spawned within simulations. It becomes far more likely that we exist in a simulated world than in the root world.

2 more replies

svantana2y ago

Haven't you heard of "turtles all the way down"?

NiloCK2y ago

Something something linked list with loop.

SushiHippie2y ago

I like that this one shows some "fails", and not just the top of the top results:

For example, the surfer is surfing in the air at the end:

https://cdn.openai.com/tmp/s/prompting_7.mp4

Or this "breaking" glass that does not break, but spills liquid in some weird way:

https://cdn.openai.com/tmp/s/discussion_0.mp4

Or the way this person walks:

https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...

Or wherever this map is coming from:

https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...

chkaloon2y ago

danans2y ago

> Or wherever this map is coming from:

> https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...

Notice also that that a roughly 6 seconds there is a third hand putting the map away.

mr_toad2y ago

> For example, the surfer is surfing in the air at the end

Maybe it’s been watching snowboarding videos and doesn’t quite understand the difference.

SiempreViernes2y ago

> Or the way this person walks:

> https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...

Also, why does she have a umbrella sticking out from her lower back?

zenon2y ago

I suppose the lady usually has an umbrella in this kind of situation, so it felt the umbrella should be included in some way: https://youtu.be/492tGcBP568

YeGoblynQueenne2y ago

sega_sai2y ago

That is creepy...

SiempreViernes2y ago

Take a look at these adorable kangaroos to relax:

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-...

sho_hn2y ago

Is the guy with the arm sticking out of his shoulder the relaxing part? :-)

SushiHippie2y ago

I think you forgot the most adorable one /s

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...

no but for real this seems to be the most "adorable" one:

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...

hackerlight2y ago

Where do you find the last two?

SushiHippie2y ago

Part of this website changes after the video finished and switches to the next video. There is no way to control it. These are both "X wearing Y taking a pleasant stroll in Z during W"

1 more reply

coffeebeqn2y ago

The hyper realistic and plausible movement of the glass breaking makes this bizarrely fascinating. And it doesn’t give me the feeling of disgust the motion in the more primitive AI models did

1 more reply

modeless2y ago

> Other interactions, like eating food, do not always yield correct changes in object state

So this is why they haven't shown Will Smith eating spaghetti.

> These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world

SiempreViernes2y ago

YeGoblynQueenne2y ago

nopinsight2y ago

  “Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”

General, superhuman robotic capabilities on the software side can be achieved once such a simulator is good enough. (Whether that can be achieved with this approach is still not certain.)

Nathanba2y ago

rasmusfaber2y ago

Another instance of the bitter lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

stravant2y ago

> then generate a polygonal representation of what you see

It's really not that surprising since, to be honest, meshes suck.

1 more reply

roenxi2y ago

throwaway2902y ago

That's still not a simulation if camera recording shows only what we see.

guybedo2y ago

I think it's Ylecun who stated a few times that video was the better way to train large models as it's more information dense.

iliane52y ago

Watching an entirely generated video of someone painting is crazy.

I can't wait to play with this but I can't even imagine how expensive it must be. They're training in full resolution and can generate up to a minute of video.

data-ottawa2y ago

I know the main post has been getting a lot of reaction, but this page absolutely blew me away. The results are striking.

lairv2y ago

Nihilartikel2y ago

I was similarly astonished at this adaptation of stable diffusion to make HDR spherical environment maps from existing images- https://diffusionlight.github.io/

emadm2y ago

Yeah we were surprised by that, video models are great 3d prices and image models are great video model priors

larschdk2y ago

lairv2y ago

Surely it's not perfect, but this was not the case for previous video generation algorithms

beezlebroxxxxxx2y ago

crooked-v2y ago

That leaves me wondering if it'd be possible to get some variant of the model to directly output 3D meshes and camera animation instead of an image.

nodja2y ago

This is also true for 2D diffusion models[1]. I suppose they need to understand how 3d works for stuff like lighting/shadows/object occlusion, etc.

[1] https://dreamfusion3d.github.io/

TOMDM2y ago

I wonder how much it'd improve if trained on stereo image data.

QuadmasterXLII2y ago

Moving camera is just stereo.

pedrovhb2y ago

anonyfox2y ago

I know its a delicate topic people (especially in the US) don't want to speak about at all, but damn, this is a giant market and could do humanity good if done well.

kenning2y ago

the reaction of a pedo or worse lol

michalf62y ago

There are thousands of porn consumers with destroyed reward circuitry per every porn actor, of which few are mistreated and the majority are compensated very well.

Producing a neverending supply of wireheading-like addictive stimuli is the farthest possible thing from a good for humanity.

Want to do good in this area - work on ways to limit consumption.

93po2y ago

This is easily proven inaccurate. The largest network of sex workers is OnlyFans. Go look up stats for OF creators. It's 99.5% people making basically zero money, and 0.01% making 99% of the money.

zone4112y ago

https://twitter.com/LechMazur/status/1607929403421462528

https://twitter.com/LechMazur/status/1619032477951213568

dang2y ago

Related ongoing thread:

Sora: Creating video from text - https://news.ycombinator.com/item?id=39386156 - Feb 2024 (1430 comments)

GaggiX2y ago

The Minecraft demo makes me think that soon will be playing games directly from the output of one of these models, unlimited content.

kevingadd2y ago

I could imagine a text adventure game with a 'visual' component perhaps working if you got the model to maintain enough consistency in spaces and character appearances.

Jordan-1172y ago

Imagine something like GAN Theft Auto V powered by this technology:

https://www.youtube.com/watch?v=udPY5rQVoW0

koonsolo2y ago

Yesterday I was watching a movie on Netflix and thought to myself, what if Netflix generated a movie based on what I want to see and what I like.

Plus, it could generate it in real time and take my responses into account. I look bored? Spice it up, etc.

Today such a thing seems closer than I thought.

binary1322y ago

Maybe this says more about me than about the technology, but I found the consistency of the Minecraft simulation super impressive.

lanternfish2y ago

Presumably, this would work for non-minecraft applications, but Minecraft has a really standardized interface layer

binary1322y ago

It’s interesting to think about what the internal model of a NN of a game might be like, and whether it could be exposed with an API that someone could program against.

chankstein382y ago

This is the second Sora announcement I've seen. Am I missing how I can play with it? The examples in the papers are all well and good but I want to get my hands on it and try it.

proc02y ago

myth_drannon2y ago

Should I short all the 3d tool s/movies/vfx companies?

drcode2y ago

I think not in the short term- I'm guessing the next step will be to use traditional tools to make a "draft" of a desired video, then "finish it off" with this kind of deep learning tech.

So this tech will increase interest in existing 3d tools in the short term

hackerlight2y ago

nojs2y ago

colesantiago2y ago

Damn, even minecraft videos being simulated, this is crazy to see from OpenAI.

Edit, changed the links to the direct ones!

https://cdn.openai.com/tmp/s/simulation_6.mp4

https://cdn.openai.com/tmp/s/simulation_7.mp4

cptroot2y ago

These videos honestly give me less confidence in the approach, simply because I don't know that the model will be able to distinguish these world model parameters as "unchanging".

bongodongobob2y ago

As someone who is a human born in the 80s, this shit wasn't supposed to even be theoretically possible.

People didn't think this was possible even a year ago.

"Gives me less confidence"

Cmon man...

SushiHippie2y ago

But it starts to make sense, when you think about the fact that Minecraft is owned by Microsoft.

example video links from TFA:

https://cdn.openai.com/tmp/s/simulation_6.mp4

https://cdn.openai.com/tmp/s/simulation_7.mp4

empath-nirvana2y ago

I don't think that has anything to do with it. There's just millions of hours of minecraft video on youtube and twitch.

SushiHippie2y ago

But If they wouldn't have the ok from Microsoft there would be a chance that they start talking about copyright infringement.

pmontra2y ago

htrp2y ago

Every cv preprocessing pipeline is in shambles now.

vunderba2y ago

The improvement to temporal consistency given that the length of these generated videos is 3 to 4 times longer than anything else on the market (runway, pika, etc) is truly remarkable.

sjwhevvvvvsj2y ago

This is insanely good but look at the legs around 16 seconds in, they kinda morph through each other. Generally the legs are slightly unnerving.

Still, god damn.

tomxor2y ago

danavar2y ago

Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.

This is just a contrived, interesting viewpoint of the technology, right?

mr_toad2y ago

“it does not accurately model the physics of many basic interactions, like glass shattering.”

Vecr2y ago

> Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.

newswasboring2y ago

This is a totally silly thought, but I still want to get it out there.

> Other interactions, like eating food, do not always yield correct changes in object state

anirudhv272y ago

pellucide2y ago

I am a newbie to this area. Honest questions:

Is this generating videos as streaming content e.g. like a mp4 video. As far as I can see, it is doing that. Is it possible for AI to actually produce the 3d models?

What kind of compute resources are required to produce the 3d models.

andybak2y ago

You can feed the output to NeRF or Gaussian Splat generators to produce 3d models:

https://twitter.com/BenMildenhall/status/1758224827788468722

https://twitter.com/ScottieFoxTTV/status/1758272455603327455

The key is that the video has spatial consistency. Once you've got that, then other existing tech can take the output and infer actual spatial forms.

jk_tech2y ago

This is some incredible and fascinating work! The applications seem endless.

1. High quality video or image from text 2. Taking in any content as input and generating forwards/backwards in time 3. Style transformation 4. Digital World simulation!

exe342y ago

The current development of AI seems like speed run of Crystal Society in terms of their interaction with the world. The only thing missing is the Inner Purpose.

neurostimulant2y ago

Where's the training data come from? Youtube?

lbrito2y ago

Okay, The Matrix can't be too far away now.

shiroiushi2y ago

Maybe before long they can make whole movies just with prompts giving a plot outline. Then they can finally make a sequel to The Matrix.

93po2y ago

Finding Nemo except they're all in BDSM outfits

mr_toad2y ago

Having the same cat enter the shot twice would definitely be a glitch in the Matrix.

blueprint2y ago

> Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

so they're gonna include the never-before-observed-but-predicted Unruh effect, as well? and other quantum theory? cool..

> For example, it does not accurately model the physics of many basic interactions, like glass shattering.

... oh

FeepingCreature2y ago

The point isn't "physical simulator" like supercomputers, it's "physical simulator" like the human brain.

blueprint2y ago

yeah, well maybe they should've said that, and they obviously would have if they intended to distinguish between the two of them instead of over inflating their claims in order to raise more money

either that, or they're not qualified to distinguish between the two options you presented, which is obviously not true

right?

the article's bazillion references have nothing to do with physics

nor is there any substance behind the claim of physical simulation

what is actually being simulated is synthesis of any video

it's not that complicated and you and they must know it too

it's double speak easily seen from miles away

scammers have to downvote

liuliu2y ago

Wow, this is really just scale-up DiT. We are going to see tons of similar models very soon.

yakito2y ago

Does anyone know why most of the videos are in slow motion?

93po2y ago

I'm guessing that either the shots its trained on have a propensity to be slo-mo (like puppies playing in a field) or making it slow-motion makes unnatural movement a lot less obvious

tokai2y ago

SuaveSteve2y ago

advael2y ago

bawana2y ago

SORA.....the entire movie industry is now out of a job.

stephenitis2y ago

Did you actually read the post? This is not the case, yet.

Also the concept of learning to simulating the world seems more important than just the media and content implications.

RayVR2y ago

If there's one thing I've always wanted, it's shitty video knockoffs of real life. Can't wait to stream some AI hallucinations.

danielbln2y ago

andybak2y ago

So is the only thing you watch on TV, unedited documentary footage?

Also - ironic choice of username considering this comment!

j / k navigate · click thread line to collapse