https://arxivlens.com/PaperView/Details/reinforcement-world-model-learning-for-llm-based-agents-4370-e83828d9

Executive Summary
Detailed Breakdown
Practical Applications

\n","updatedAt":"2026-02-06T21:10:46.175Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.754965603351593},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"6986975ba587f226253ff2e6","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-07T01:37:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [From Word to World: Can Large Language Models be Implicit Text-based World Models?](https://huggingface.co/papers/2512.18832) (2025)\n* [DynaWeb: Model-Based Reinforcement Learning of Web Agents](https://huggingface.co/papers/2601.22149) (2026)\n* [When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents](https://huggingface.co/papers/2512.11277) (2025)\n* [Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models](https://huggingface.co/papers/2601.18734) (2026)\n* [Meta-RL Induces Exploration in Language Agents](https://huggingface.co/papers/2512.16848) (2025)\n* [SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning](https://huggingface.co/papers/2512.13159) (2025)\n* [Reinforcement Learning for Self-Improving Agent with Skill Library](https://huggingface.co/papers/2512.17102) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-07T01:37:31.073Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7497526407241821},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.05842","authors":[{"_id":"69855bf84ad556f294b7eb1e","name":"Xiao Yu","hidden":false},{"_id":"69855bf84ad556f294b7eb1f","name":"Baolin Peng","hidden":false},{"_id":"69855bf84ad556f294b7eb20","name":"Ruize Xu","hidden":false},{"_id":"69855bf84ad556f294b7eb21","name":"Yelong Shen","hidden":false},{"_id":"69855bf84ad556f294b7eb22","name":"Pengcheng He","hidden":false},{"_id":"69855bf84ad556f294b7eb23","name":"Suman Nath","hidden":false},{"_id":"69855bf84ad556f294b7eb24","name":"Nikhil Singh","hidden":false},{"_id":"69855bf84ad556f294b7eb25","name":"Jiangfeng Gao","hidden":false},{"_id":"69855bf84ad556f294b7eb26","name":"Zhou Yu","hidden":false}],"publishedAt":"2026-02-05T16:30:08.000Z","submittedOnDailyAt":"2026-02-06T00:43:22.026Z","title":"Reinforcement World Model Learning for LLM-based Agents","submittedOnDailyBy":{"_id":"61942296d5c2ba6daa290357","avatarUrl":"/avatars/594021cc183c4922d48b46f43772a062.svg","isPro":false,"fullname":"Baolin Peng","user":"Baolin","type":"user"},"summary":"Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and τ^2 Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and τ^2 Bench respectively, while matching the performance of expert-data training.","upvotes":27,"discussionId":"69855bf84ad556f294b7eb27","ai_summary":"Reinforcement World Model Learning enables LLM-based agents to better anticipate action consequences and adapt to environment dynamics through self-supervised training that aligns simulated and real-world state transitions in embedding space.","ai_keywords":["world-modeling","reinforcement learning","action-conditioned world models","sim-to-real gap rewards","next-state token prediction","reward hacking","task-success rewards","embedding space","agent-based systems"],"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6234fd736dcfc5fe9f5b8601","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1647639915850-noauth.jpeg","isPro":false,"fullname":"Xiao Yu","user":"jasonyux","type":"user"},{"_id":"61942296d5c2ba6daa290357","avatarUrl":"/avatars/594021cc183c4922d48b46f43772a062.svg","isPro":false,"fullname":"Baolin Peng","user":"Baolin","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"622474f38dc6b0b64f5e903d","avatarUrl":"/avatars/d6b60a014277a8ec7d564163c5f644aa.svg","isPro":false,"fullname":"Yuxin Zuo","user":"yuxinzuo","type":"user"},{"_id":"6680b6c42698e06471df1307","avatarUrl":"/avatars/5570ab6041c70f91828df2f9ddfd6303.svg","isPro":false,"fullname":"Li","user":"BX2001","type":"user"},{"_id":"6463554dd2044cd1d7c6e0bf","avatarUrl":"/avatars/d7653623117268c545a7063fec69664b.svg","isPro":false,"fullname":"Bingzheng Wei","user":"Bingzheng","type":"user"},{"_id":"675dd24a2c98629a5e49dfac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/tI3V8-PZ8d3CC32fzO31e.png","isPro":false,"fullname":"Starstrek","user":"Stars321123","type":"user"},{"_id":"6389d7e7d87ef988fe6acc73","avatarUrl":"/avatars/621afbb669599076f498d6bdd52dd71f.svg","isPro":false,"fullname":"linjianman","user":"linjianman","type":"user"},{"_id":"61e52be53d6dbb1da842316a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61e52be53d6dbb1da842316a/gx0WGPcOCClXPymoKglc4.jpeg","isPro":false,"fullname":"Börje Karlsson","user":"tellarin","type":"user"},{"_id":"655601f1ae085c2ba7a22b95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/4UmxFrc_TEiXcnm3RewZM.jpeg","isPro":false,"fullname":"Xiaoji Zheng","user":"Student-Xiaoji","type":"user"},{"_id":"6570450a78d7aca0c361a177","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6570450a78d7aca0c361a177/MX7jHhTQwLs-BvYIu5rqb.jpeg","isPro":false,"fullname":"Harold Chen","user":"Harold328","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"}}">

Papers

arxiv:2602.05842

Reinforcement World Model Learning for LLM-based Agents

Published on Feb 5

· Submitted by

Baolin Peng on Feb 6

Microsoft Research

Upvote

Authors:

Abstract

Reinforcement World Model Learning enables LLM-based agents to better anticipate action consequences and adapt to environment dynamics through self-supervised training that aligns simulated and real-world state transitions in embedding space.

AI-generated summary

Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and τ^2 Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and τ^2 Bench respectively, while matching the performance of expert-data training.

View arXiv page View PDF Add to collection

Community

Baolin

Paper submitter 14 days ago

Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and τ2 Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and τ2 Bench respectively, while matching the performance of expert-data training.