https://embodied-reasoning-agent.github.io

\n","updatedAt":"2025-10-15T02:19:47.834Z","author":{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","fullname":"Rui Yang","name":"Ray2333","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9368889927864075},"editors":["Ray2333"],"editorAvatarUrls":["/avatars/9bb8205b889337df5d321539c9b5d69d.svg"],"reactions":[],"isReport":false}},{"id":"68f04bae147606aea77293a4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-10-16T01:34:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities](https://huggingface.co/papers/2510.08759) (2025)\n* [VLA-R1: Enhancing Reasoning in Vision-Language-Action Models](https://huggingface.co/papers/2510.01623) (2025)\n* [Nav-R1: Reasoning and Navigation in Embodied Scenes](https://huggingface.co/papers/2509.10884) (2025)\n* [Learning Primitive Embodied World Models: Towards Scalable Robotic Learning](https://huggingface.co/papers/2508.20840) (2025)\n* [Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning](https://huggingface.co/papers/2510.11027) (2025)\n* [Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation](https://huggingface.co/papers/2508.13998) (2025)\n* [PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models](https://huggingface.co/papers/2509.15607) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-10-16T01:34:38.430Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7823155522346497},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.12693","authors":[{"_id":"68ef0007486b78128f0e339c","user":{"_id":"6700b1f93381f2db06857fb5","avatarUrl":"/avatars/c8b9ec7c00773c5a4055ba50de0c6b2f.svg","isPro":false,"fullname":"Hanyang Chen","user":"Hanyang81","type":"user"},"name":"Hanyang Chen","status":"claimed_verified","statusLastChangedAt":"2025-10-15T07:07:38.765Z","hidden":false},{"_id":"68ef0007486b78128f0e339d","name":"Mark Zhao","hidden":false},{"_id":"68ef0007486b78128f0e339e","user":{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},"name":"Rui Yang","status":"claimed_verified","statusLastChangedAt":"2025-10-15T03:12:56.270Z","hidden":false},{"_id":"68ef0007486b78128f0e339f","name":"Qinwei Ma","hidden":false},{"_id":"68ef0007486b78128f0e33a0","name":"Ke Yang","hidden":false},{"_id":"68ef0007486b78128f0e33a1","name":"Jiarui Yao","hidden":false},{"_id":"68ef0007486b78128f0e33a2","name":"Kangrui Wang","hidden":false},{"_id":"68ef0007486b78128f0e33a3","user":{"_id":"62927c2e56fedc76e396b3ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678105603200-62927c2e56fedc76e396b3ca.jpeg","isPro":false,"fullname":"HAO BAI","user":"JackBAI","type":"user"},"name":"Hao Bai","status":"claimed_verified","statusLastChangedAt":"2026-01-08T08:33:57.593Z","hidden":false},{"_id":"68ef0007486b78128f0e33a4","name":"Zhenhailong Wang","hidden":false},{"_id":"68ef0007486b78128f0e33a5","name":"Rui Pan","hidden":false},{"_id":"68ef0007486b78128f0e33a6","name":"Mengchao Zhang","hidden":false},{"_id":"68ef0007486b78128f0e33a7","name":"Jose Barreiros","hidden":false},{"_id":"68ef0007486b78128f0e33a8","name":"Aykut Onol","hidden":false},{"_id":"68ef0007486b78128f0e33a9","name":"ChengXiang Zhai","hidden":false},{"_id":"68ef0007486b78128f0e33aa","name":"Heng Ji","hidden":false},{"_id":"68ef0007486b78128f0e33ab","name":"Manling Li","hidden":false},{"_id":"68ef0007486b78128f0e33ac","name":"Huan Zhang","hidden":false},{"_id":"68ef0007486b78128f0e33ad","name":"Tong Zhang","hidden":false}],"publishedAt":"2025-10-14T16:25:46.000Z","submittedOnDailyAt":"2025-10-15T00:49:47.828Z","title":"ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning\n and Online Reinforcement Learning","submittedOnDailyBy":{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},"summary":"Recent advances in embodied AI highlight the potential of vision language\nmodels (VLMs) as agents capable of perception, reasoning, and interaction in\ncomplex environments. However, top-performing systems rely on large-scale\nmodels that are costly to deploy, while smaller VLMs lack the necessary\nknowledge and skills to succeed. To bridge this gap, we present\nEmbodied Reasoning Agent (ERA), a two-stage framework that integrates\nprior knowledge learning and online reinforcement learning (RL). The first\nstage, Embodied Prior Learning, distills foundational knowledge from\nthree types of data: (1) Trajectory-Augmented Priors, which enrich existing\ntrajectory data with structured reasoning generated by stronger models; (2)\nEnvironment-Anchored Priors, which provide in-environment knowledge and\ngrounding supervision; and (3) External Knowledge Priors, which transfer\ngeneral knowledge from out-of-environment datasets. In the second stage, we\ndevelop an online RL pipeline that builds on these priors to further enhance\nagent performance. To overcome the inherent challenges in agent RL, including\nlong horizons, sparse rewards, and training instability, we introduce three key\ndesigns: self-summarization for context management, dense reward shaping, and\nturn-level policy optimization. Extensive experiments on both high-level\nplanning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate\nthat ERA-3B surpasses both prompting-based large models and previous\ntraining-based baselines. Specifically, it achieves overall improvements of\n8.4\\% on EB-ALFRED and 19.4\\% on EB-Manipulation over GPT-4o, and exhibits\nstrong generalization to unseen tasks. Overall, ERA offers a practical path\ntoward scalable embodied intelligence, providing methodological insights for\nfuture embodied AI systems.","upvotes":28,"discussionId":"68ef0007486b78128f0e33ae","projectPage":"https://embodied-reasoning-agent.github.io","githubRepo":"https://github.com/Embodied-Reasoning-Agent/Embodied-Reasoning-Agent","githubRepoAddedBy":"auto","githubStars":31,"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-uploads.huggingface.co/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},{"_id":"64cb1ad1667f4f80852f6050","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cb1ad1667f4f80852f6050/iOn5q_RyyBS99tObrO5Tc.png","isPro":false,"fullname":"Rui Pan","user":"research4pan","type":"user"},{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","isPro":false,"fullname":"Jiarui Yao","user":"FlippyDora","type":"user"},{"_id":"6700b1f93381f2db06857fb5","avatarUrl":"/avatars/c8b9ec7c00773c5a4055ba50de0c6b2f.svg","isPro":false,"fullname":"Hanyang Chen","user":"Hanyang81","type":"user"},{"_id":"6618d1c3c1167c8d8702b19d","avatarUrl":"/avatars/24611ca6c4158f4978ac8476a87d8d9c.svg","isPro":false,"fullname":"Ruida WANG","user":"RickyDeSkywalker","type":"user"},{"_id":"6726688788f2f9df271830ab","avatarUrl":"/avatars/1b7db87f241b0caa3e5d08298d5ff0c0.svg","isPro":false,"fullname":"Yuxing Liu","user":"yuxing6","type":"user"},{"_id":"672599b9fab7cbfa939ab183","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/qD0INOZFSssS3FNg193Xk.png","isPro":false,"fullname":"Yifan Hao","user":"yifanhao99","type":"user"},{"_id":"6667a10b1e179b4fbe6a49ec","avatarUrl":"/avatars/5a48a6a245cd6bf902da83546a936bd5.svg","isPro":false,"fullname":"Qinwei Ma","user":"AquaPony02","type":"user"},{"_id":"662a7759efa616e734ab493d","avatarUrl":"/avatars/8b79c6ec01f13d6b82414a6ad1b2d588.svg","isPro":false,"fullname":"Ke Yang","user":"EmpathYang","type":"user"},{"_id":"6470e0f1cfd57849519033a5","avatarUrl":"/avatars/7ffefee3e36a4e37b9f4510bc6b689d1.svg","isPro":false,"fullname":"Hanning Zhang","user":"HanningZhang","type":"user"},{"_id":"67af4f86857add161ebca3e1","avatarUrl":"/avatars/64891187e87185f609343bfaa683cf8d.svg","isPro":false,"fullname":"Yusin","user":"keta12","type":"user"},{"_id":"68ef1157341020f6e39f687d","avatarUrl":"/avatars/377fef01a01f987c7a566c1925fd29e8.svg","isPro":false,"fullname":"J","user":"GingerrrJ","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-uploads.huggingface.co/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"}}">

Papers

arxiv:2510.12693

ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

Published on Oct 14, 2025

· Submitted by

Rui Yang on Oct 15, 2025

University of Illinois at Urbana-Champaign

Upvote

Authors:

Hanyang Chen ,

Rui Yang ,

Hao Bai ,

Abstract

Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, we present Embodied Reasoning Agent (ERA), a two-stage framework that integrates prior knowledge learning and online reinforcement learning (RL). The first stage, Embodied Prior Learning, distills foundational knowledge from three types of data: (1) Trajectory-Augmented Priors, which enrich existing trajectory data with structured reasoning generated by stronger models; (2) Environment-Anchored Priors, which provide in-environment knowledge and grounding supervision; and (3) External Knowledge Priors, which transfer general knowledge from out-of-environment datasets. In the second stage, we develop an online RL pipeline that builds on these priors to further enhance agent performance. To overcome the inherent challenges in agent RL, including long horizons, sparse rewards, and training instability, we introduce three key designs: self-summarization for context management, dense reward shaping, and turn-level policy optimization. Extensive experiments on both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate that ERA-3B surpasses both prompting-based large models and previous training-based baselines. Specifically, it achieves overall improvements of 8.4\% on EB-ALFRED and 19.4\% on EB-Manipulation over GPT-4o, and exhibits strong generalization to unseen tasks. Overall, ERA offers a practical path toward scalable embodied intelligence, providing methodological insights for future embodied AI systems.

View arXiv page View PDF Project page GitHub 31 auto Add to collection

Community

Ray2333

Paper author Paper submitter Oct 15, 2025

The paper studies training VLM-based embodied agents with a two-stage approach: Embodied Prior Learning followed by online reinforcement learning. It shows that three types of prior data strengthen agents before RL and introduces strategies for stable, effective online RL in multi-turn VLM agents. Project page: https://embodied-reasoning-agent.github.io