Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
[go: Go Back, main page]

\n\t\t\n\t\n\t\n\t\tAgent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning\n\t\n\n

๐Ÿš€ Introducing Agent World Model (AWM) โ€” we synthesized 1,000 code-driven environments with 35K tools and 10K tasks for large-scale agentic reinforcement learning!

\n

No real APIs. No human design. Just 100 seed names โ†’ fully functional, database-backed agent environments exposed via MCP interface.

\n

Agents trained purely on synthetic envs generalize to out-of-distribution benchmarks. Code, Environments, & Models all open-sourced. ๐Ÿ”ฅ

\n

We train Qwen3 (4B/8B/14B) with online RL using GRPO algorithm at serious scale:

\n

โšก 1,024 parallel env instances per training step
๐ŸŽฏ Hybrid reward: step-level format checks + task-level outcome verification
๐Ÿง  History-aware training: align sliding-window truncation between training & inference

\n

Key insight: code-driven environments give more stable learning signals than LLM-simulated ones, and they're orders of magnitude faster.

\n

Results on 3 out-of-distribution benchmarks (AWM does NOT target any benchmark specific ones):

\n

๐Ÿ“Š BFCLv3: 8B jumps 53.83 โ†’ 65.94 (+12.11)
๐Ÿ“Š ฯ„ยฒ-bench: competitive, 14B reaches 39.03 Pass@1
๐Ÿ“Š MCP-Universe: best overall, 8B: 6.70 โ†’ 11.17

\n

๐Ÿ† AWM is the ONLY method that improves over Base on ALL three benchmarks.

\n

๐Ÿ“„ Paper: https://arxiv.org/abs/2602.10090
๐Ÿ’ป Code: https://github.com/Snowflake-Labs/agent-world-model
๐Ÿค— Huggingface: https://huggingface.co/datasets/Snowflake/AgentWorldModel-1K

\n","updatedAt":"2026-02-11T04:58:53.826Z","author":{"_id":"633122d3f242a8532b7a928d","avatarUrl":"/avatars/2158ffff0882a8fb4588e273fd60dea7.svg","fullname":"Chi","name":"ChilleD","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7646274566650391},"editors":["ChilleD"],"editorAvatarUrls":["/avatars/2158ffff0882a8fb4588e273fd60dea7.svg"],"reactions":[],"isReport":false}},{"id":"698d3006c8c22d2561a96d51","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2026-02-12T01:42:30.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards](https://huggingface.co/papers/2601.22511) (2026)\n* [From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents](https://huggingface.co/papers/2601.22607) (2026)\n* [ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool-Use Agent Training](https://huggingface.co/papers/2602.06820) (2026)\n* [AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning](https://huggingface.co/papers/2512.22857) (2025)\n* [EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis](https://huggingface.co/papers/2601.05808) (2026)\n* [ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas](https://huggingface.co/papers/2601.21558) (2026)\n* [Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing](https://huggingface.co/papers/2512.23611) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-12T01:42:30.654Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.736871600151062},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.10090","authors":[{"_id":"698bf6fe6052d3bed9630ac0","name":"Zhaoyang Wang","hidden":false},{"_id":"698bf6fe6052d3bed9630ac1","name":"Canwen Xu","hidden":false},{"_id":"698bf6fe6052d3bed9630ac2","name":"Boyi Liu","hidden":false},{"_id":"698bf6fe6052d3bed9630ac3","name":"Yite Wang","hidden":false},{"_id":"698bf6fe6052d3bed9630ac4","name":"Siwei Han","hidden":false},{"_id":"698bf6fe6052d3bed9630ac5","name":"Zhewei Yao","hidden":false},{"_id":"698bf6fe6052d3bed9630ac6","name":"Huaxiu Yao","hidden":false},{"_id":"698bf6fe6052d3bed9630ac7","name":"Yuxiong He","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/633122d3f242a8532b7a928d/at1Sens0OXJ4Yt8ne9kAE.png","https://cdn-uploads.huggingface.co/production/uploads/633122d3f242a8532b7a928d/Hs08PrK-yZHZ5FBmPLoMV.png","https://cdn-uploads.huggingface.co/production/uploads/633122d3f242a8532b7a928d/IIL00IFjA5UOIILbLKOY9.png","https://cdn-uploads.huggingface.co/production/uploads/633122d3f242a8532b7a928d/o5mHd9GtSOBL8Ni_s_J8B.png","https://cdn-uploads.huggingface.co/production/uploads/633122d3f242a8532b7a928d/jsnsJiwl4N10Px2LgZ9Ve.png"],"publishedAt":"2026-02-10T18:55:41.000Z","submittedOnDailyAt":"2026-02-11T02:28:53.819Z","title":"Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning","submittedOnDailyBy":{"_id":"633122d3f242a8532b7a928d","avatarUrl":"/avatars/2158ffff0882a8fb4588e273fd60dea7.svg","isPro":true,"fullname":"Chi","user":"ChilleD","type":"user"},"summary":"Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.","upvotes":49,"discussionId":"698bf6ff6052d3bed9630ac8","projectPage":"https://github.com/Snowflake-Labs/agent-world-model","githubRepo":"https://github.com/Snowflake-Labs/agent-world-model","githubRepoAddedBy":"user","ai_summary":"Large language model agents trained in synthetic environments with code-driven simulations and database-backed state transitions demonstrate superior out-of-distribution generalization compared to traditional benchmark-specific approaches.","ai_keywords":["large language model","autonomous agents","multi-turn interactions","tool-use agents","reinforcement learning","synthetic environment generation","code-driven environments","database-backed state transitions","out-of-distribution generalization"],"githubStars":226,"organization":{"_id":"62cece4aa3a23014aca72499","name":"Snowflake","fullname":"Snowflake","avatar":"https://cdn-uploads.huggingface.co/production/uploads/64dc52cf858f8a41c12fc819/O9-MWzRjWzbNP_DQlMb-7.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5e3c29d8f55e2b62848a5224","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1618318588221-5e3c29d8f55e2b62848a5224.png","isPro":false,"fullname":"Canwen Xu","user":"canwenxu","type":"user"},{"_id":"682233e9ddda5d39df0ce7de","avatarUrl":"/avatars/800519cb15ba990ded23539b3a668f29.svg","isPro":false,"fullname":"Yite Wang","user":"yitewangsnowflake","type":"user"},{"_id":"633122d3f242a8532b7a928d","avatarUrl":"/avatars/2158ffff0882a8fb4588e273fd60dea7.svg","isPro":true,"fullname":"Chi","user":"ChilleD","type":"user"},{"_id":"674d74c28c28f1a8d0a3e2a4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ZSBPdNZ7n8aXLUK3PV8qG.png","isPro":false,"fullname":"MA Junxiao","user":"MaJunxiao","type":"user"},{"_id":"65dafc22ad7ccf910d7144da","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dafc22ad7ccf910d7144da/bsGJXsVjwJTVoqSO0b1O3.jpeg","isPro":false,"fullname":"Yuetai Li","user":"TaiGary","type":"user"},{"_id":"6757df5e861539cddf7add20","avatarUrl":"/avatars/2ba7a85d475f86f417b5bdfc2b6641fb.svg","isPro":false,"fullname":"autumn","user":"unfair221","type":"user"},{"_id":"657e466649ec77d48e891110","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657e466649ec77d48e891110/So7g62eizTT9U8Xf2JrqZ.jpeg","isPro":false,"fullname":"Shangzhe Li","user":"DVA13304","type":"user"},{"_id":"660fe0fea61244f3df52bf85","avatarUrl":"/avatars/04d3c86db7c40c35ccee7f61d3f0bfe2.svg","isPro":false,"fullname":"yuxiao yang","user":"yuxiaoyang","type":"user"},{"_id":"6730dc8df84c8aac97451e57","avatarUrl":"/avatars/4f2cf5363b17744daca41d2a18ddfeb8.svg","isPro":false,"fullname":"Yinjie Wang","user":"yinjiewang","type":"user"},{"_id":"662a7759efa616e734ab493d","avatarUrl":"/avatars/8b79c6ec01f13d6b82414a6ad1b2d588.svg","isPro":false,"fullname":"Ke Yang","user":"EmpathYang","type":"user"},{"_id":"65a558d48ac0602820241c3e","avatarUrl":"/avatars/37bd29981187c251499a77c2cb418e5b.svg","isPro":false,"fullname":"Zipeng Ling","user":"ZpLing","type":"user"},{"_id":"664da3097137fdfe69ae81d1","avatarUrl":"/avatars/3169d8e34ec6ed10d0663b8085dc9cd9.svg","isPro":false,"fullname":"rc","user":"ruichenzhang","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"62cece4aa3a23014aca72499","name":"Snowflake","fullname":"Snowflake","avatar":"https://cdn-uploads.huggingface.co/production/uploads/64dc52cf858f8a41c12fc819/O9-MWzRjWzbNP_DQlMb-7.png"}}">
Papers
arxiv:2602.10090

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Published on Feb 10
ยท Submitted by
Chi
on Feb 11
Authors:
,
,
,
,
,
,
,

Abstract

Large language model agents trained in synthetic environments with code-driven simulations and database-backed state transitions demonstrate superior out-of-distribution generalization compared to traditional benchmark-specific approaches.

AI-generated summary

Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

Community

Paper submitter

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

๐Ÿš€ Introducing Agent World Model (AWM) โ€” we synthesized 1,000 code-driven environments with 35K tools and 10K tasks for large-scale agentic reinforcement learning!

No real APIs. No human design. Just 100 seed names โ†’ fully functional, database-backed agent environments exposed via MCP interface.

Agents trained purely on synthetic envs generalize to out-of-distribution benchmarks. Code, Environments, & Models all open-sourced. ๐Ÿ”ฅ

We train Qwen3 (4B/8B/14B) with online RL using GRPO algorithm at serious scale:

โšก 1,024 parallel env instances per training step
๐ŸŽฏ Hybrid reward: step-level format checks + task-level outcome verification
๐Ÿง  History-aware training: align sliding-window truncation between training & inference

Key insight: code-driven environments give more stable learning signals than LLM-simulated ones, and they're orders of magnitude faster.

Results on 3 out-of-distribution benchmarks (AWM does NOT target any benchmark specific ones):

๐Ÿ“Š BFCLv3: 8B jumps 53.83 โ†’ 65.94 (+12.11)
๐Ÿ“Š ฯ„ยฒ-bench: competitive, 14B reaches 39.03 Pass@1
๐Ÿ“Š MCP-Universe: best overall, 8B: 6.70 โ†’ 11.17

๐Ÿ† AWM is the ONLY method that improves over Base on ALL three benchmarks.

๐Ÿ“„ Paper: https://arxiv.org/abs/2602.10090
๐Ÿ’ป Code: https://github.com/Snowflake-Labs/agent-world-model
๐Ÿค— Huggingface: https://huggingface.co/datasets/Snowflake/AgentWorldModel-1K

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.10090 in a Space README.md to link it from this page.

Collections including this paper 4