Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-10-04T01:38:05.623Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7375907897949219},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.01623","authors":[{"_id":"68df33ebdf49fb0df1e03c92","name":"Angen Ye","hidden":false},{"_id":"68df33ebdf49fb0df1e03c93","user":{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/GvHk_KSdE9Rhnk_o-NaZX.jpeg","isPro":false,"fullname":"Zeyu Zhang","user":"SteveZeyuZhang","type":"user"},"name":"Zeyu Zhang","status":"claimed_verified","statusLastChangedAt":"2025-10-05T12:47:08.590Z","hidden":true},{"_id":"68df33ebdf49fb0df1e03c94","name":"Boyuan Wang","hidden":false},{"_id":"68df33ebdf49fb0df1e03c95","user":{"_id":"6426616ea5ec4a5cbc535634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6426616ea5ec4a5cbc535634/5IfSFYd9QOxz8K9QmBCst.png","isPro":false,"fullname":"JeffWang","user":"Jeff-Wang","type":"user"},"name":"Xiaofeng Wang","status":"claimed_verified","statusLastChangedAt":"2025-10-23T15:02:42.307Z","hidden":false},{"_id":"68df33ebdf49fb0df1e03c96","name":"Dapeng Zhang","hidden":false},{"_id":"68df33ebdf49fb0df1e03c97","user":{"_id":"656e9b562cd7a3e348011d26","avatarUrl":"/avatars/bcca51bdc27c664f8f132420e6ed99fa.svg","isPro":false,"fullname":"Zheng Zhu","user":"ZhengZhu","type":"user"},"name":"Zheng Zhu","status":"claimed_verified","statusLastChangedAt":"2025-12-03T09:20:04.626Z","hidden":false}],"publishedAt":"2025-10-02T02:54:03.000Z","submittedOnDailyAt":"2025-10-03T00:55:59.842Z","title":"VLA-R1: Enhancing Reasoning in Vision-Language-Action Models","submittedOnDailyBy":{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/GvHk_KSdE9Rhnk_o-NaZX.jpeg","isPro":false,"fullname":"Zeyu Zhang","user":"SteveZeyuZhang","type":"user"},"summary":"Vision-Language-Action (VLA) models aim to unify perception, language\nunderstanding, and action generation, offering strong cross-task and\ncross-scene generalization with broad impact on embodied AI. However, current\nVLA models often lack explicit step-by-step reasoning, instead emitting final\nactions without considering affordance constraints or geometric relations.\nTheir post-training pipelines also rarely reinforce reasoning quality, relying\nprimarily on supervised fine-tuning with weak reward design. To address these\nchallenges, we present VLA-R1, a reasoning-enhanced VLA that integrates\nReinforcement Learning from Verifiable Rewards (RLVR) with Group Relative\nPolicy Optimization (GRPO) to systematically optimize both reasoning and\nexecution. Specifically, we design an RLVR-based post-training strategy with\nverifiable rewards for region alignment, trajectory consistency, and output\nformatting, thereby strengthening reasoning robustness and execution accuracy.\nMoreover, we develop VLA-CoT-13K, a high-quality dataset that provides\nchain-of-thought supervision explicitly aligned with affordance and trajectory\nannotations. Furthermore, extensive evaluations on in-domain, out-of-domain,\nsimulation, and real-robot platforms demonstrate that VLA-R1 achieves superior\ngeneralization and real-world performance compared to prior VLA methods. We\nplan to release the model, code, and dataset following the publication of this\nwork. Code: https://github.com/GigaAI-research/VLA-R1. Website:\nhttps://gigaai-research.github.io/VLA-R1.","upvotes":12,"discussionId":"68df33ebdf49fb0df1e03c98","projectPage":"https://gigaai-research.github.io/VLA-R1","githubRepo":"https://github.com/GigaAI-research/VLA-R1","githubRepoAddedBy":"user","ai_summary":"VLA-R1 enhances VLA models with RLVR and GRPO to improve reasoning and execution, achieving better generalization and real-world performance using a new dataset with chain-of-thought supervision.","ai_keywords":["VLA models","Reinforcement Learning from Verifiable Rewards (RLVR)","Group Relative Policy Optimization (GRPO)","region alignment","trajectory consistency","output formatting","chain-of-thought supervision","affordance","trajectory annotations"],"githubStars":67,"organization":{"_id":"68f5feb151a9558c3dc84362","name":"GigaAI-Research","fullname":"GigaAI-Research","avatar":"https://cdn-uploads.huggingface.co/production/uploads/68b68f3cdd7f21b758968b8d/8uazzXK5v3edLbCaNvmup.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6853f5840d9c15f931986112","avatarUrl":"/avatars/05e5c678ee6716f7380ae71559bb3632.svg","isPro":false,"fullname":"Maksym Riabov","user":"MRiabov","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65bb837dbfb878f46c77de4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bb837dbfb878f46c77de4c/23gZ_lBEwyoqjexFy9QLD.jpeg","isPro":true,"fullname":"Prithiv Sakthi","user":"prithivMLmods","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"68985c7e6b3a1bec3449bc8c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/X6Mdp_asCeXqzltLaOoqq.png","isPro":false,"fullname":"Shirley-Chen","user":"Shirley-Chen","type":"user"},{"_id":"664cb66017586a96342785c0","avatarUrl":"/avatars/a8fe303411c8c2f0bbd309b15a4c0026.svg","isPro":false,"fullname":"Wei Liu","user":"lefutonku","type":"user"},{"_id":"67406965d4b6b78c917e4634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67406965d4b6b78c917e4634/rc2u718EjU_5CiSwICN6J.jpeg","isPro":false,"fullname":"seohyun","user":"happy8825","type":"user"},{"_id":"661faaa0f959ccf1cb7ea223","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661faaa0f959ccf1cb7ea223/FMxkONiQKysxs6W3G-lgO.jpeg","isPro":false,"fullname":"Zhiwei Xia","user":"sijitu","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"},{"_id":"69763f0e195d4e2c6d917c00","avatarUrl":"/avatars/c7df3b6cc5b1f3642395654aba8a1125.svg","isPro":false,"fullname":"Buffet","user":"JohnRobotics","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"68f5feb151a9558c3dc84362","name":"GigaAI-Research","fullname":"GigaAI-Research","avatar":"https://cdn-uploads.huggingface.co/production/uploads/68b68f3cdd7f21b758968b8d/8uazzXK5v3edLbCaNvmup.png"}}">
Papers
arxiv:2510.01623

VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Published on Oct 2, 2025
· Submitted by
Zeyu Zhang
on Oct 3, 2025
Authors:
,
,
,

Abstract

VLA-R1 enhances VLA models with RLVR and GRPO to improve reasoning and execution, achieving better generalization and real-world performance using a new dataset with chain-of-thought supervision.

AI-generated summary

Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.

Community

Paper author Paper submitter

Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.01623 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.01623 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.01623 in a Space README.md to link it from this page.

Collections including this paper 7