Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.
You can probably already imagine different ways to wire the output to text generation and controlling its own motions, etc, and predicting outcomes based on actions it, itself could plausibly take, and choosing the best one.
It doesn't actually have to generate realistic imagery or imagery that doesn't have any mistakes or imagery that's high definition to be used in that way. How realistic is our own imagination of the world?
Edit: I'm going to add a specific case. Imagine a house cleaning robot. It starts with an image of your living room. Then it creates a image of your living room after it's been cleaned. Then it interpolates a video _imagining itself cleaning the room_, then acts as much as it can to mimic what's in the video, then generates a new continuation, then acts, and so on. Imagine doing that several times a second, if necessary.
Check out V-Jepa for such a system: https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...
In theory, yes. The problem is we've had AGI many times before, in theory. For example, Q learning, feed the state of any game or system through a neural network, have it predict possible future rewards, iteratively improve the accuracy of the reward predictions, and boom, eventually you arrive at the optimal behavior for any system. We've know this since... the 70's maybe? I don't know how far Q-learning goes back.
I like to do experiments with reinforcement learning and it's always exciting to think "once I turn this thing on, it's going to work well and find lots of neat solutions to the problem", and the thing is, it's true, that might happen, but usually it doesn't. Usually I see some signs of learning, but it fails to come up with anything spectacular.
I keep watching for a strong AI in a video game like Civilization as a sign that AI can solve problems in a highly complex system while also being practical enough that game creators are able to implement it in a practical way. Yes, maybe, maybe, a team with experts could solve Civilization as a research project, but that's far from being practical. Do you think we'll be able to show an AI a video of people playing Civilization and have the video predict the best moves before the AI in the game is able to predict the best moves?
It might not be too crazy of an idea - would love to see a model fine-tuned on sequences of moves.
The biggest limitation of video game AI currently is not theory, but hardware. Once home compute doubles a few more times, we’ll all be running GPT-4 locally and a competent Civilization AI starts to look realistic.
And even if it succeeds, it fails again as soon as you change the environment because RL doesn't generalise. At all. It's kind of shocking to be honest.
https://robertkirk.github.io/2022/01/17/generalisation-in-re...
Projecting into the future in 3d world space is actually what the endgame for robotics is and I imagine depending on how complex that 3d world model is, a working model for projecting into 3d space could be waaaaaay smaller.
It's just that the equivalent data is not as easily available on the internet :)
Not sure how people are concluding that realistic physics is feasible operating solely in pixel space, because obviously it can't and anyone with any experience training such models would recognize instantly the local optimum these demos represent. The point of inductive bias is to make the loss function as convex as possible by inducing a parametrization that is "natural" to the system being modeled. Physics is exactly the attempt to formalize such a model borne of human cognitive faculties and it's hard to imagine that you can do better with less fidelity by just throwing more parameters and data at the problem, especially when the parametrization is so incongruent to the inherent dynamics at play.
[0] https://github.com/LiheYoung/Depth-Anything
[1] https://medium.com/@patriciogv/the-state-of-the-art-of-depth...
As another comment points out that's Yann LeCun's idea of "Objective-Driven AI" introduced in [1] though not named that in the paper (LeCun has named it that way in talks and slides). LeCun has also said that this won't be achieved with generative models. So, either 1 out of 2 right, or both wrong, one way or another.
For me, I've been in AI long enough to remember many such breakthroughs that would lead to AGI before - from DeepBlue (actually) to CNNs, to Deep RL, to LLMs just now, etc. Either all those were not the breakthroughs people thought at the time, or it takes many more than an engineering breakthrough to get to AGI, otherwise it's hard to explain why the field keeps losing its mind about the Next Big Thing and then forgetting about it a few years later, when the Next Next Big Thing comes around.
But, enough with my cynicism. You think that idea can work? Try it out. In a simplified environment. Take some stupid grid world, a simplification of a text-based game like Nethack [2] and try to implement your idea, in-vitro, as it were. See how well it works. You could write a paper about it.
____________________
[1] https://openreview.net/pdf?id=BZ5a1r-kVsf
[2] Obviously don't start with Nethack itself because that's damn hard for "AI".
future successor to Sora + likely successor to GPT-4 = ASI
See my other comment here: https://news.ycombinator.com/item?id=39391971
A key element of anything that can be classified as "general intelligence" is developing internally consistent and self-contained agency, and then being able to act on that. Today we have absolutely no idea of how to do this in AI. Even the tiniest of worms and insects demonstrate capabilities several orders of magnitude beyond what our largest AIs can.
We are about as close to AGI as James Watt was to nuclear fusion.
So right now, Sora is predicting "Hollywood style" content, with cuts, camera motions, etc... all much like what you'd expect to see in an edited film.
Nothing stops someone (including OpenAI) from training the same architecture with "real world captures".
Imagine telling a bunch of warehouse workers that for "safety" they all need to wear a GoPro-like action camera on their helmets that record everything inside the work area. Run that in a bunch of warehouses with varying sizes, content, and forklifts, and then pump all of that through this architecture to train it. Include the instructions given to the staff from the ERP system as well as the transcribed audio as the text prompt.
Ta-da.
You have yourself an AI that can control a robot using the same action camera as its vision input. It will be able to follow instructions from the ERP, listen to spoken instructions, and even respond with a natural voice. It'll even be able to handle scenarios such as spills, breaks, or other accidents... just like the humans in its training data did. This is basically what vehicle auto-pilots do, but on steroids.
Sure, the computer power required for this is outrageously expensive right now, but give it ten to twenty years and... no more manual labour.
A Pacman bot is not AGI. Might get it to eat all the dots correctly where as before if something scrolled off the screen it'd forget about it and glitch out - but you didn't fan any flames of consciousness into existence as of yet.
Imagine where you want to be (eg, “I scored a goal!”) from where you are now, visualize how you’ll get there (eg, a trick and then a shot), then do that.
It'll be much harder for more open ended world problems where the physics encountered may be rare enough in the dataset that the simulation breaks unexpectedly. For example a glass smashing into the floor. The model doesn't simulate that causally afaik
There was that article a few months ago about how basically that's what the cerebellum does.
(and throw this in for good measure https://www.wired.com/story/this-lab-grown-skin-could-revolu... heh)
staring at a painting in a Museum
Then immediately jumping into an entire VR world based off the painting generated by an AI rendering it out on the fly
For example, the surfer is surfing in the air at the end:
https://cdn.openai.com/tmp/s/prompting_7.mp4
Or this "breaking" glass that does not break, but spills liquid in some weird way:
https://cdn.openai.com/tmp/s/discussion_0.mp4
Or the way this person walks:
https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...
Or wherever this map is coming from:
https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...
> https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...
Notice also that that a roughly 6 seconds there is a third hand putting the map away.
Maybe it’s been watching snowboarding videos and doesn’t quite understand the difference.
> https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...
Also, why does she have a umbrella sticking out from her lower back?
https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...
https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-...
So this is why they haven't shown Will Smith eating spaghetti.
> These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world
This is exciting for robotics. But an even closer application would be filling holes in gaussian splatting scenes. If you want to make a 3D walkthrough of a space you need to take hundreds to thousands of photos with seamless coverage of every possible angle, and you're still guaranteed to miss some. Seems like a model this capable could easily produce plausible reconstructions of hidden corners or close up detail or other things that would just be holes or blurry parts in a standard reconstruction. You might only need five or ten regular photos of a place to get a completely seamless and realistic 3D scene that you could explore from any angle. You could also do things like subtract people or other unwanted objects from the scene. Such an extrapolated reconstruction might not be completely faithful to reality in every detail, but I think this could enable lots of applications regardless.
“Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”
General, superhuman robotic capabilities on the software side can be achieved once such a simulator is good enough. (Whether that can be achieved with this approach is still not certain.)Why superhuman? Larger context length than our working memory is an obvious one, but there will likely be other advantages such as using alternative sensory modalities and more granular simulation of details unfamiliar to most humans.
It's really not that surprising since, to be honest, meshes suck.
They're pretty general graphs but to actually work nicely they have to have really specific topological characteristics. Half of the work you do with meshes is repeatedly coaxing them back into a sane topology after editing them.
The results really are impressive. Being able to generate such high quality videos, to extend videos in the past and in the future shows how much the model "understands" the real world, objects interaction, 3D composition, etc...
Although image generation already requires the model to know a lot about the world, i think there's really a huge gap with video generation where the model needs to "know" 3D, objects movements and interactions.
I can't wait to play with this but I can't even imagine how expensive it must be. They're training in full resolution and can generate up to a minute of video.
Seeing how bad video generation was, I expected it would take a few more years to get to this but it seems like this is another case of "Add data & compute"(TM) where transformers prove once again they'll learn everything and be great at it
The robot examples are very underwhelming, but the people and background people are all very well done, and at a level much better than most static image diffusion models produce. Generating the same people as the interact with objects is also not something I expected a model like this to do well so soon.
The crazy thing is that they do it by prompting the model to in paint a chrome sphere into the middle of the image to reflect what is behind the camera! The model can interpret the context and dream up what is plausibly in the whole environment.
It is impressive at a glance, but if you pay attention, it is more like dreaming than realism (images conjured out of other images, without attention to long term temporal, spatial, and causal consistency). I'm hardly more impressed that Google's deep dream, which is 10 years old.
Surely it's not perfect, but this was not the case for previous video generation algorithms
I know its a delicate topic people (especially in the US) don't want to speak about at all, but damn, this is a giant market and could do humanity good if done well.
Producing a neverending supply of wireheading-like addictive stimuli is the farthest possible thing from a good for humanity.
Want to do good in this area - work on ways to limit consumption.
Sora: Creating video from text - https://news.ycombinator.com/item?id=39386156 - Feb 2024 (1430 comments)
I could imagine a text adventure game with a 'visual' component perhaps working if you got the model to maintain enough consistency in spaces and character appearances.
Plus, it could generate it in real time and take my responses into account. I look bored? Spice it up, etc.
Today such a thing seems closer than I thought.
Presumably, this would work for non-minecraft applications, but Minecraft has a really standardized interface layer
So this tech will increase interest in existing 3d tools in the short term
Edit, changed the links to the direct ones!
These videos honestly give me less confidence in the approach, simply because I don't know that the model will be able to distinguish these world model parameters as "unchanging".
People didn't think this was possible even a year ago.
"Gives me less confidence"
Cmon man...
example video links from TFA:
Every cv preprocessing pipeline is in shambles now.
Still, god damn.
Everything also has a sort of mushy feel about it, like nothing is anchored down, everything is swimming. Impressive all the same, but maybe these methods need to be paired with some old fashioned 3d rendering to serve as physically based guideline.
Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.
This is just a contrived, interesting viewpoint of the technology, right?
That's not true, AI systems in general have pretty strong mathematical proofs going back decades on what they can theoretically do, the problem is compute and general feasibility. AIXItl in theory would be able to learn reasoning, logic, formal systems, physics, human emotions, and a great deal of everything else just from watching videos. They would have to be videos of varied and useful things, but even if they were not, you'd at least get basic reasoning, logic, and physics.
> Other interactions, like eating food, do not always yield correct changes in object state
Can this be because we just don't shoot a lot of people eating? I think it is general advice to not show people eating on camera for various reasons. I wonder if we know if that kind of topic bias exists in the dataset.
Is this generating videos as streaming content e.g. like a mp4 video. As far as I can see, it is doing that. Is it possible for AI to actually produce the 3d models?
What kind of compute resources are required to produce the 3d models.
https://twitter.com/BenMildenhall/status/1758224827788468722
https://twitter.com/ScottieFoxTTV/status/1758272455603327455
The key is that the video has spatial consistency. Once you've got that, then other existing tech can take the output and infer actual spatial forms.
1. High quality video or image from text 2. Taking in any content as input and generating forwards/backwards in time 3. Style transformation 4. Digital World simulation!
so they're gonna include the never-before-observed-but-predicted Unruh effect, as well? and other quantum theory? cool..
> For example, it does not accurately model the physics of many basic interactions, like glass shattering.
... oh
Isn't all of the training predicated on visible, gathered data - rather than theory? if so, I don't think it's right to call these things simulators of the physical world if they don't include physical theory. DFT at least has some roots in theory.
The simple counter argument is that they're claiming that there is potential for general purpose physical simulation. So please don't play with words. I thought we were supposed to be talking about something concrete instead of whatever you wanted to twist things into to suit your purpose
either that, or they're not qualified to distinguish between the two options you presented, which is obviously not true
right?
the article's bazillion references have nothing to do with physics
nor is there any substance behind the claim of physical simulation
what is actually being simulated is synthesis of any video
it's not that complicated and you and they must know it too
it's double speak easily seen from miles away
scammers have to downvote
Some of the reasons the less optimistic scenarios seem likely is that the kinds of extrapolation errors this model makes are of similar character to those of LLMs: extrapolation follows a gradient of smooth apparent transitions rather than some underlying logic about the objects portrayed, and sometimes seems to just sort of ignore situations that are far enough outside of what it's seen rather than reconcile them. For example, the tidal wave/historical hall example is a scenario unlikely to have been in the training data. Sure, there's the funny bit at the end where the surfer appears to levitate in the air, but there's a much larger issue with how these two contrasting scenes interact, or rather fail to. What we see looks a lot more like a scene of surfing superimposed via photoshop or something on a still image of the hall, as there's no evidence of the water interacting with the seats or walls in the hall at all. The model will just roll with whatever you tell it to do as best it can, but it's not doing something like modeling "what would happen if" that implausible scenario played out, and even doing it poorly would be a better sign for this doing something like "simulating" the described scenario. Instead, we have impressive results for prompts that likely strongly correspond to scenes the model may have seen, and evidence of a lack of composition in cases where a particular composition is unlikely to have been seen and needs some underlying understanding of how it "would" work that is visible to us
Also the concept of learning to simulating the world seems more important than just the media and content implications.
Also - ironic choice of username considering this comment!