Qwen-AgentWorld: Language World Models for General Agents (opens in new tab)

(arxiv.org)

175 pointsilreb4d ago47 comments

47 comments

38 comments · 16 top-level

adrian_b3d ago· 5 in thread

The smaller of the two models is open weights and available on Huggingface:

https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B

npodbielski3d ago

I tried to run it but seems like it is either broken or it does not work on dockerized llama.cpp:

0.01.865.326 E llama_model_load: error loading model: missing tensor 'blk.40.attn_norm.weight'

khimaros3d ago

that particular quant is just corrupted. these work but seem to loop in reasoning a lot https://huggingface.co/groxaxo/Qwen-AgentWorld-35B-A3B-GGUF

npodbielski2d ago

unsloth version works.

walrus013d ago

Give it a day or two and the 'unsloth' people will probably publish a Q6 and Q8 (maybe Q8XL?) quantization in GGUF format for llama-server and other users.

khimaros3d ago

they have arrived: https://huggingface.co/unsloth/Qwen-AgentWorld-35B-A3B-GGUF

psc0073d ago· 4 in thread

Eli5? What is this compared to a regular llm assistant model like the base qwen?

gavmor3d ago

A regular LLM acts as a "policy," mapping a current state to a specific action (states → actions). Their new LLM acts as a "world model," mapping a current state and a chosen action to a predicted future state ((states, actions) → subsequent states). Instead of deciding "what to do," its explicit objective is to predict the exact environment observation that will result from the interaction history and the agent's current action.

I assumed at first that it was trained on synthetic data, but they actually went and deployed real physical hosts and virtual machines (e.g. Ubuntu, macOS, and Android) and browsers. They ran agentic systems on these continuously and recorded the actual, real-world interactions.

So it's an LLM that infers next state, or outcome,as structured data e.g. literal HTML code, UI view hierarchies, or accessibility trees.

dmos623d ago

So, if I'm reading this correctly, whereas a regular LLM would, given a prompt to edit a file, infer a sed call, this "world" model infers the resulting contents of the file.

kakugawa3d ago

Here's the demo: https://docs.qwenlm.ai/resources/mlu56_demo.html

Here's the description of the world model prompt for the web domain: "A precise GUI state simulator — given the current screen (as HTML) and a user action, predicts the exact next screen as a complete, self-contained HTML document." (You can click the world model prompt box to expand it and see the full prompt.)

So the world model generates the current state (an html document), an agent tells it what action it wants to perform, the world model generates the next state (another html document).

The other domains are similar, but w/ domain-specific nuance.

1 more reply

Freedumbs3d ago

Same thing, but qwen has decided to rebrand certain LLMs that were trained slightly differently as "world models". Despite the fact that "world model" typically means !LLM.

Tepix3d ago· 4 in thread

The labels of the very first chart (figure 1, bottom left) are obviously wrong which casts a doubt on the entire paper.

dudisubekti3d ago

This label?

> Figure 1: Overview of Qwen-AgentWorld. Top: Qwen-AgentWorld is a unified native language world model across seven domains. Bottom: We explore two complementary strategies for applying world modeling to enhance language agents (mainly using the 35B-A3B model as agent): Decouple and Unify , where the world model serves as the environment simulator and agent foundation model, respectively.

Where is the mistake?

Tepix3d ago

The deltas are wrong.

The bars above the label "Infinite Real-World Envs" show growth for example from approx 42 to 55 but the red label says "+7.1". It's wrong for all of them.

dudisubekti3d ago

Ah I see. Yeah the graphics are probably AI-generated, and AIs do struggle with unit consistency in charts.

(For another example, the charts in the August 2025 GPT-5 presentation)

yorwba3d ago

According to Table 6, it's supposed to be 47.9 to 55.

1 more reply

verdverm3d ago· 3 in thread

35B model from the qwen-3.5 line

https://github.com/QwenLM/Qwen-AgentWorld

https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B

khimaros3d ago

unsloth, activate!

verdverm3d ago

I'm using official @8bit quants from Qwen, they maintain more capability

jedisct13d ago

FOr Mac users: https://huggingface.co/collections/jedisct1/qwen-agentworld

androiddrew3d ago· 2 in thread

I understand what the model is doing. I am struggling to understand where this is going to fit in a workflow. I understand a big gap is that any LLM based ai agent isn't aware of the consequences of its actions because it barely understands the future state its actions will have, hence this model that can.

So, is this like a bolt on where you have an agent powered by an LLM, then the world model reviews the action it wants to take, and the agent confirms this is the intention? Like is this to augment an existing agent with additional capabilities?

roenxi3d ago

> I understand a big gap is that any LLM based ai agent isn't aware of the consequences of its actions because it barely understands the future state its actions will have, hence this model that can.

These are probably equivalent. Ie, awareness of consequences is the same as understanding the future state. And the present state for that matter, I don't see how someone could be said to understand something if they can't predict the consequences of interacting with it. It is forcing the model to develop a more complex internal world model.

anana_3d ago

It looks like the purpose of this model is to i. generate environmental sim data for doing RL on other models or ii. act as a foundation model (they trained it to select actions as well as predicting the next state in the same loop?)

Either way, neither are intended for end consumers.

aliljet3d ago· 2 in thread

The benchmarks here are confusing at best. Am I reading correctly that this model is essentially as good or better than all frontier models right now?

anana_3d ago

I believe the benchmark listed is about simulating the environment for the various tasks, rather than doing them. It seems that the point of this model is to generate sim data to improve other models with

blourvim3d ago

Benchmarks in general are a little iffy, the whole industry is going off of vibes anyways. Can't decide before trying it out

dippogriff3d ago· 1 in thread

I'm a fan of this direction. For me the most interesting use case for these world models isn't even training, it's verification. If this thing or some idealized version of it can actually reliably simulate state transitions, could you use it to verify an agent's execution path against hard constraints and replace/eclipse LLMs-as-a-judge?

nostrebored3d ago

Well if you can do this then you don't delegate execution path derivation to the agent. The benefit is a predictable coherent world state where you understand the impact of { current state } x { action } without having to enumerate that huge cartesian product.

1 more reply

singularity20013d ago· 1 in thread

I thought in this day and age "world model" also includes robo arm training data and robot arm benchmarks

blenklo3d ago

Never heard that.

A world model builds itself a model of the world in which it can simulate an outcome.

In best case its not depending on robotic, otherwise it will be quite limiting for what you can use it.

You can imagine what happens when you write your boss a very inappropriate email, you don't need robotic arms for it.

Xx_crazy420_xX3d ago

I think open-ended simulation for agents will be a key component for training and planning. Similar as human dreams simulate different scenarios in our head. Biggest challenge will be simulating more abstract and complex systems.

Few months ago I did experiment with an open-ended world simulation for AI agent, where the simulated world was progressively building itself based on each of agent actions in open-ended manner. The idea was to give an agent infinite possibility regarding tool calling, where the tool call would be approved by the adjudicator, and the world state would change. The key issues with the PoC were:

  - World decoherence (tried to solve that with a poor graph implementation)
  - World flatness - high abstraction did not account for small events that would compound in real world
  - Start with empty context was real issue to get the agent to explore the world

Anyways the project came to be really funny when you watched agent struggling in desperation to perform real world actions which would be impossible in real world. Main observation was that when presented agent with current action budget, it modulated the creativity and how desperate its actions were.

3 more replies

blurbleblurble3d ago

This might be pretty big. One of my biggest frustrations with smaller models (especially MoE) is their failure to track workflow state at a high level. I'm constantly reminding them what we decided on or asking them to revisit, and reminding them eats context.

Seems like this might make that a lot less painful. And if not off the bat, with some minimal tuning or even just good prompting.

pulkas3d ago

I think the next movement is heading to multi model orchestration.

https://developer.nvidia.com/blog/train-small-orchestration-...

trilogic3d ago

https://hugston.com/news/qwen-35b-agentworld-insights

https://hugston.com/models/hugston-qwen-agentworldq4-k-m

tsunamifury3d ago

This questions the nature of banning SOTA models like fable deeply.

As simpler models with better simulated context will be able to more practically execute than SOTAs without such training.

To me this says we should open fable up for defensive reasons rather than fear offensive use. SOTA models will be continuously outmatched by better technique lower grade models with better context techniques like this plus longer walks and deeper inference.

Now you might says SOTAs then could use that and go even further… but how are you going to keep that cat in the bag anyways?

avaer3d ago

Note this can run locally on a gaming card with quant. I got it running on a 4090 (24GB) 150 t/s with a Q4_K_M.

ElenaDaibunny3d ago

10M trajectories, probably more of a data scale win than a world model breakthrough tbh

zkmon3d ago

What if they did this using GLM 5.2? This looks like a new direction for AI.

j / k navigate · click thread line to collapse

47 comments

38 comments · 16 top-level

adrian_b3d ago· 5 in thread

The smaller of the two models is open weights and available on Huggingface:

https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B

npodbielski3d ago

I tried to run it but seems like it is either broken or it does not work on dockerized llama.cpp:

0.01.865.326 E llama_model_load: error loading model: missing tensor 'blk.40.attn_norm.weight'

khimaros3d ago

that particular quant is just corrupted. these work but seem to loop in reasoning a lot https://huggingface.co/groxaxo/Qwen-AgentWorld-35B-A3B-GGUF

npodbielski2d ago

unsloth version works.

walrus013d ago

Give it a day or two and the 'unsloth' people will probably publish a Q6 and Q8 (maybe Q8XL?) quantization in GGUF format for llama-server and other users.

khimaros3d ago

they have arrived: https://huggingface.co/unsloth/Qwen-AgentWorld-35B-A3B-GGUF

psc0073d ago· 4 in thread

Eli5? What is this compared to a regular llm assistant model like the base qwen?

gavmor3d ago

So it's an LLM that infers next state, or outcome,as structured data e.g. literal HTML code, UI view hierarchies, or accessibility trees.

dmos623d ago

So, if I'm reading this correctly, whereas a regular LLM would, given a prompt to edit a file, infer a sed call, this "world" model infers the resulting contents of the file.

kakugawa3d ago

Here's the demo: https://docs.qwenlm.ai/resources/mlu56_demo.html

So the world model generates the current state (an html document), an agent tells it what action it wants to perform, the world model generates the next state (another html document).

The other domains are similar, but w/ domain-specific nuance.

1 more reply

Freedumbs3d ago

Same thing, but qwen has decided to rebrand certain LLMs that were trained slightly differently as "world models". Despite the fact that "world model" typically means !LLM.

Tepix3d ago· 4 in thread

The labels of the very first chart (figure 1, bottom left) are obviously wrong which casts a doubt on the entire paper.

dudisubekti3d ago

This label?

Where is the mistake?

Tepix3d ago

The deltas are wrong.

The bars above the label "Infinite Real-World Envs" show growth for example from approx 42 to 55 but the red label says "+7.1". It's wrong for all of them.

dudisubekti3d ago

Ah I see. Yeah the graphics are probably AI-generated, and AIs do struggle with unit consistency in charts.

(For another example, the charts in the August 2025 GPT-5 presentation)

yorwba3d ago

According to Table 6, it's supposed to be 47.9 to 55.

1 more reply

verdverm3d ago· 3 in thread

35B model from the qwen-3.5 line

https://github.com/QwenLM/Qwen-AgentWorld

https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B

khimaros3d ago

unsloth, activate!

verdverm3d ago

I'm using official @8bit quants from Qwen, they maintain more capability

jedisct13d ago

FOr Mac users: https://huggingface.co/collections/jedisct1/qwen-agentworld

androiddrew3d ago· 2 in thread

roenxi3d ago

> I understand a big gap is that any LLM based ai agent isn't aware of the consequences of its actions because it barely understands the future state its actions will have, hence this model that can.

anana_3d ago

Either way, neither are intended for end consumers.

aliljet3d ago· 2 in thread

The benchmarks here are confusing at best. Am I reading correctly that this model is essentially as good or better than all frontier models right now?

anana_3d ago

blourvim3d ago

Benchmarks in general are a little iffy, the whole industry is going off of vibes anyways. Can't decide before trying it out

dippogriff3d ago· 1 in thread

nostrebored3d ago

1 more reply

singularity20013d ago· 1 in thread

I thought in this day and age "world model" also includes robo arm training data and robot arm benchmarks

blenklo3d ago

Never heard that.

A world model builds itself a model of the world in which it can simulate an outcome.

In best case its not depending on robotic, otherwise it will be quite limiting for what you can use it.

You can imagine what happens when you write your boss a very inappropriate email, you don't need robotic arms for it.

Xx_crazy420_xX3d ago

  - World decoherence (tried to solve that with a poor graph implementation)
  - World flatness - high abstraction did not account for small events that would compound in real world
  - Start with empty context was real issue to get the agent to explore the world

3 more replies

blurbleblurble3d ago

Seems like this might make that a lot less painful. And if not off the bat, with some minimal tuning or even just good prompting.

pulkas3d ago

I think the next movement is heading to multi model orchestration.

https://developer.nvidia.com/blog/train-small-orchestration-...

trilogic3d ago

https://hugston.com/news/qwen-35b-agentworld-insights

https://hugston.com/models/hugston-qwen-agentworldq4-k-m

tsunamifury3d ago

This questions the nature of banning SOTA models like fable deeply.

As simpler models with better simulated context will be able to more practically execute than SOTAs without such training.

Now you might says SOTAs then could use that and go even further… but how are you going to keep that cat in the bag anyways?

avaer3d ago

Note this can run locally on a gaming card with quant. I got it running on a 4090 (24GB) 150 t/s with a Q4_K_M.

ElenaDaibunny3d ago

10M trajectories, probably more of a data scale win than a world model breakthrough tbh

zkmon3d ago

What if they did this using GLM 5.2? This looks like a new direction for AI.

j / k navigate · click thread line to collapse