We built an 8B world model that beats 402B Llama 4 by generating web code instead of pixels

Here’s something new for you: Mobile World Models.

We just released gWorld — open-weight visual world models for mobile GUIs (8B and 32B).

Demo Video Explanation:

Here’s gWorld 32B imagining a multi-step Booking dot com session — zero access to the real app:

Sees flight search form (Detroit → Chicago)
Click “Search” → writes code → renders full results page with airlines, prices, times
Click destination field → predicts the search UI with history

Every screen = executable HTML/CSS/JS rendered to pixels.

The core idea: Instead of predicting the next screen as pixels (diffusion, autoregressive image gen), gWorld predicts it as executable web code. You render the code, you get the image. This sounds simple but it works remarkably well because VLMs already have strong priors on structured web code from pre-training.

Why code instead of pixels?

Text-based world models lose visual fidelity (can’t represent layouts, colors, images) Pixel-generation models hallucinate text and structural elements Code generation gives you the best of both: precise text rendering from linguistic priors + high-fidelity visuals from structured code

Results on MWMBench (6 benchmarks, 4 ID + 2 OOD):

Model Size Avg Accuracy

Qwen3 VL 8B 29.2%

Llama 4 Scout 109B (A17B) 50.0%

Llama 4 Maverick 402B (A17B) 55.7%

Qwen3 VL 235B (A22B) 51.5%

GLM-4.6V 106B 67.4%

gWorld 8B 74.9%

gWorld 32B 79.6%

The 8B model beats everything up to 50× its size. Render failure rate is <1% (vs 40% for base Qwen3 VL 8B before our training).

Other things worth noting:

Data scaling follows a power law with R² ≥ 0.94 — gains are predictable and nowhere near saturating We include a Korean apps benchmark (KApps) as OOD eval — the models generalize well cross-lingually The data pipeline is automated: repurpose existing trajectory data → cross-modal relabeling to code → synthetic reasoning traces We also show that better world models → better downstream GUI agent performance

Why this matters beyond benchmarks: The bottleneck for training GUI agents with online RL is device-policy coupling — every rollout needs a real Android emulator. World models could decouple this entirely, enabling massively parallel rollouts on pure compute. gWorld is a step in that direction.

Links:

🤗 gWorld 8B: https://huggingface.co/trillionlabs/gWorld-8B 🤗 gWorld 32B: https://huggingface.co/trillionlabs/gWorld-32B 💻 Code: https://github.com/trillion-labs/gWorld 📄 Paper: https://huggingface.co/papers/2602.01576 🌐 Project page (and demos): https://trillionlabs-gworld.github.io Benchmarks (incl. K-Apps) coming soon.

Happy to answer questions.

Built by Trillion Labs × KAIST AI.

💬 Discussion r/LocalLLaMA (90 points, 28 commentaires) 🔗 Source

Bazaroid

Explorateur

We built an 8B world model that beats 402B Llama 4 by generating web code instead of pixels — open weights on HF

Vue Graphique