This is a preview of a very different type of computer use model—we train on the internet. Specifically we have 11 million hours of computer video stored on our storage cluster (previously shared https://news.ycombinator.com/item?id=45438496 !) and the model can work in 30 FPS. Since we match the fundamental form factor of computer-use, we can get our model to do CAD, browse websites, and even drive a car using arrow keys. I’m super excited to see what our model can do as we scale more, it's a fun frontier to work on (not language models :) ).
The team and I will be online responding to the comments, so drop any questions.
Any benchmark comparisons to Fara-7B or Sonnet 4.6, Qwen 3.5 etc.?
In particular the Forward rollout module is very important. It aligns your (effectively) world model with what it expects from the world, and keeping those in sync I think gives this the power it needs to be able to generate the state action pairs to continuously train semi supervised
Must have been really hard. What was the breakthrough?
I wanted to comment though that your title is not doing you any favors, and I suspect that is why this is not getting more traction (which it deserves). I fully expected some half baked GitHub repo, but instead found something truly awesome.
To use your own words, Neel, “ a very different type of computer use model” would have had me clicking faster. I’m not great at titles, however, and maybe there are better ideas out there.
Anyway, can’t wait to see how this develops! Especially looking forward to the CAD work.
It does make me wonder if you should have the inverse dynamics model split into specifically retrocausal and causal. You kind of do this already with the inverse and forward dynamics model, but the idea of a model that knows only about the future training in a feedback loop with a model that knows only about the past is kind of interesting.
I think you could just do a clever masking regime in your diffusion model to achieve the same effect without a whole architecture change.
> [previous models] burn a million tokens to understand just one minute of 30 FPS computer data. Our video encoder encodes nearly 2 hours of video in the same number of tokens—that’s 50x more token-efficient than the previous state-of-the-art and 100x more token-efficient than OpenAI’s encoder.
While I was already aware that there are people working on new, more efficient "world models," this is the first one I've seen in action. I'm a bit in shock at how good it is, quite frankly.
I've added the OP, as well as a related 2018 paper on Behavioral Cloning from Obervation (BCO) to my reading list.[a] So far, I've only skimmed the 2018 paper, but it's already evident that it's well-written. I'm no expert in deep RL, and I can understand it. BTW, "Behavioral Cloning from Obervation" is a really good name, with an easy-to-remember acronym.
Thank you for sharing this on HN.
Same here.
The notion of inducing these models to "hypothesize" distributions over possible actions given subsequent observed transitions makes me think of "contrastive divergence," the method Hinton and others came up with for unsupervised training of Restricted Boltzmann Machines (RBMs), in the prehistoric era of deep learning.
Given each training sample, an RBM would 1) execute a forward pass, 2) sample its output units, 3) "hypothesize" its input units, 4) execute another forward pass on the "hypothesized" input units to sample new output units, and (5) compute a type of contrastive error for local backpropagation. RMBs could be stacked, with output units from one becoming input units for the next one. Hinton called the input units "visible," and the output ones "hidden."
It's not the same, obviously, but the idea of modeling machine-generated inputs (or actions) given outputs (or transitions) has always been appealing. It has a long history.
Are the inverse dynamics and forward dynamics models trained separately? It sounds like if the inverse dynamics model is meant to extrapolate more training data, then perhaps all that means is it takes very little data to generalize directly with the forward dynamics model assuming the right architecture.
It would be pretty interesting to see activation maps for the encoder on video, confidence building to see the compression derived from so much training.
Benchmarks are really fun—lots of secret ones. Our main thesis is that you should be using the same benchmarks to measure human ability to use a computer, as you would an AI model. Definitely a suite of continuous long term planning tasks (games) and things such as marking emails as spam etc.
definitely! we are looking into more interp + visualizations in general as we scale up.
Do you have other examples of special cases you're looking at? Any 3d ones?
> We believe artificial general intelligence will be created within our lifetimes, and likely within the next decade.
Maybe within our lifetimes (if you are young) but I find it highly unlikely within the next decade.
Also, thanks for choosing a technical blog post for presenting this information.
we all have various backgrounds, me particularly i did a lot of material science x ai research and just fundamental architecture research before
Really interesting breakdown, proper nerdsniped into this, thanks for the refreshing AI news outside of language models :)
Otherwise, very cool and exciting!
I think we'll see more of these video encoder models in the coming years, they truly seem like magic.
You write:
>We created a model without this tradeoff by training our video encoder on a masked compression objective
And I understand why this would give you more detail per token, but how are you reducing total number of tokens?
Wonder how much data is generalizable across different UIs? ie how good will the model be at using Figma if it’s never seen it before but has seen a lot of Photoshop
How effective is the model on real world computer tasks
Can you prompt it or is it strictly Copilot-style prediction?
What's the plan on that front?
I’d love to see this sort of thing paired with eye tracking and turned into a general purpose precog predictive tool for computer use … but you probably have many better use cases for your world model!
Disgusting website.