Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
\n","updatedAt":"2025-04-14T15:07:56.192Z","author":{"_id":"5f0de36419cb630495b8153c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658676776546-5f0de36419cb630495b8153c.jpeg","fullname":"Tony Zhao","name":"tianchez","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":19,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7440727949142456},"editors":["tianchez"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1658676776546-5f0de36419cb630495b8153c.jpeg"],"reactions":[{"reaction":"๐ฅ","users":["tianchez","SZhanZ","s0m1"],"count":3}],"isReport":false}},{"id":"67fdb80ffdfb1dcf916231f5","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-15T01:36:15.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Visual-RFT: Visual Reinforcement Fine-Tuning](https://huggingface.co/papers/2503.01785) (2025)\n* [Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning](https://huggingface.co/papers/2503.18013) (2025)\n* [Perception-R1: Pioneering Perception Policy with Reinforcement Learning](https://huggingface.co/papers/2504.07954) (2025)\n* [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749) (2025)\n* [UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning](https://huggingface.co/papers/2503.21620) (2025)\n* [Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1](https://huggingface.co/papers/2503.24376) (2025)\n* [R1-Zero's \"Aha Moment\" in Visual Reasoning on a 2B Non-SFT Model](https://huggingface.co/papers/2503.05132) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-04-15T01:36:15.591Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7092918157577515},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"๐ฅ","users":["luosaike"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.07615","authors":[{"_id":"67fc52b190075b82590d75db","name":"Haozhan Shen","hidden":false},{"_id":"67fc52b190075b82590d75dc","name":"Peng Liu","hidden":false},{"_id":"67fc52b190075b82590d75dd","name":"Jingcheng Li","hidden":false},{"_id":"67fc52b190075b82590d75de","name":"Chunxin Fang","hidden":false},{"_id":"67fc52b190075b82590d75df","name":"Yibo Ma","hidden":false},{"_id":"67fc52b190075b82590d75e0","name":"Jiajia Liao","hidden":false},{"_id":"67fc52b190075b82590d75e1","name":"Qiaoli Shen","hidden":false},{"_id":"67fc52b190075b82590d75e2","name":"Zilun Zhang","hidden":false},{"_id":"67fc52b190075b82590d75e3","name":"Kangjia Zhao","hidden":false},{"_id":"67fc52b190075b82590d75e4","name":"Qianqian Zhang","hidden":false},{"_id":"67fc52b190075b82590d75e5","name":"Ruochen Xu","hidden":false},{"_id":"67fc52b190075b82590d75e6","name":"Tiancheng Zhao","hidden":false}],"publishedAt":"2025-04-10T10:05:15.000Z","submittedOnDailyAt":"2025-04-14T13:37:56.110Z","title":"VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model","submittedOnDailyBy":{"_id":"5f0de36419cb630495b8153c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658676776546-5f0de36419cb630495b8153c.jpeg","isPro":false,"fullname":"Tony Zhao","user":"tianchez","type":"user"},"summary":"Recently DeepSeek R1 has shown that reinforcement learning (RL) can\nsubstantially improve the reasoning capabilities of Large Language Models\n(LLMs) through a simple yet effective design. The core of R1 lies in its\nrule-based reward formulation, which leverages tasks with deterministic\nground-truth answers to enable precise and stable reward computation. In the\nvisual domain, we similarly observe that a wide range of visual understanding\ntasks are inherently equipped with well-defined ground-truth annotations. This\nproperty makes them naturally compatible with rule-based reward mechanisms.\nMotivated by this observation, we investigate the extension of R1-style\nreinforcement learning to Vision-Language Models (VLMs), aiming to enhance\ntheir visual reasoning capabilities. To this end, we develop VLM-R1, a\ndedicated framework designed to harness RL for improving VLMs' performance on\ngeneral vision-language tasks. Using this framework, we further explore the\nfeasibility of applying RL to visual domain. Experimental results indicate that\nthe RL-based model not only delivers competitive performance on visual\nunderstanding tasks but also surpasses Supervised Fine-Tuning (SFT) in\ngeneralization ability. Furthermore, we conduct comprehensive ablation studies\nthat uncover a series of noteworthy insights, including the presence of reward\nhacking in object detection, the emergence of the \"OD aha moment\", the impact\nof training data quality, and the scaling behavior of RL across different model\nsizes. Through these analyses, we aim to deepen the understanding of how\nreinforcement learning enhances the capabilities of vision-language models, and\nwe hope our findings and open-source contributions will support continued\nprogress in the vision-language RL community. Our code and model are available\nat https://github.com/om-ai-lab/VLM-R1","upvotes":35,"discussionId":"67fc52b390075b82590d7634","githubRepo":"https://github.com/om-ai-lab/VLM-R1","githubRepoAddedBy":"user","ai_summary":"A reinforcement learning framework extends the capabilities of Vision-Language Models (VLMs) in visual reasoning by leveraging rule-based reward formulations, achieving competitive performance and superior generalization compared to supervised fine-tuning.","ai_keywords":["reinforcement learning","Large Language Models (LLMs)","rule-based reward formulation","deterministic ground-truth answers","Vision-Language Models (VLMs)","VLM-R1","visual understanding tasks","reward hacking","OD aha moment","training data quality","scaling behavior"],"githubStars":5845},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5f0de36419cb630495b8153c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658676776546-5f0de36419cb630495b8153c.jpeg","isPro":false,"fullname":"Tony Zhao","user":"tianchez","type":"user"},{"_id":"64a234a8d94c3f9ccec8476d","avatarUrl":"/avatars/633adf359b1a65f458c3b5a7e59f11c0.svg","isPro":false,"fullname":"Ruochen Xu","user":"ruochenx","type":"user"},{"_id":"630c8cf754c3dbd4804f1f03","avatarUrl":"/avatars/3ad24dea8c91df1632e34b98ea4b144f.svg","isPro":false,"fullname":"Kyusong Lee","user":"kyusonglee","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"664da7874eb4c91c8c32d5cc","avatarUrl":"/avatars/42e2e5850404f1bf0f161e188300b830.svg","isPro":false,"fullname":"yyl","user":"yyl123ddd","type":"user"},{"_id":"641d44b121964f8f6d4b213e","avatarUrl":"/avatars/af38a6977313e9d4dcaa485698cb622b.svg","isPro":false,"fullname":"Ying","user":"Heting","type":"user"},{"_id":"656e8fc82e0a38afd18fd996","avatarUrl":"/avatars/2bac70dff5f974d2bba83acf40141e24.svg","isPro":false,"fullname":"KeleiJiang","user":"jkl375","type":"user"},{"_id":"658a2e94991d8e7fb24f7688","avatarUrl":"/avatars/bd1b2597cfb85a69ea37840f8d44d283.svg","isPro":false,"fullname":"Jiajia Liao","user":"Liaojiajia","type":"user"},{"_id":"6461d22cddb3aaa43c8b20b8","avatarUrl":"/avatars/9692cd0bcde8f012d823c17dab6f23bd.svg","isPro":false,"fullname":"Qianqian","user":"qq-hzlh","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A reinforcement learning framework extends the capabilities of Vision-Language Models (VLMs) in visual reasoning by leveraging rule-based reward formulations, achieving competitive performance and superior generalization compared to supervised fine-tuning.
AI-generated summary
Recently DeepSeek R1 has shown that reinforcement learning (RL) can
substantially improve the reasoning capabilities of Large Language Models
(LLMs) through a simple yet effective design. The core of R1 lies in its
rule-based reward formulation, which leverages tasks with deterministic
ground-truth answers to enable precise and stable reward computation. In the
visual domain, we similarly observe that a wide range of visual understanding
tasks are inherently equipped with well-defined ground-truth annotations. This
property makes them naturally compatible with rule-based reward mechanisms.
Motivated by this observation, we investigate the extension of R1-style
reinforcement learning to Vision-Language Models (VLMs), aiming to enhance
their visual reasoning capabilities. To this end, we develop VLM-R1, a
dedicated framework designed to harness RL for improving VLMs' performance on
general vision-language tasks. Using this framework, we further explore the
feasibility of applying RL to visual domain. Experimental results indicate that
the RL-based model not only delivers competitive performance on visual
understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in
generalization ability. Furthermore, we conduct comprehensive ablation studies
that uncover a series of noteworthy insights, including the presence of reward
hacking in object detection, the emergence of the "OD aha moment", the impact
of training data quality, and the scaling behavior of RL across different model
sizes. Through these analyses, we aim to deepen the understanding of how
reinforcement learning enhances the capabilities of vision-language models, and
we hope our findings and open-source contributions will support continued
progress in the vision-language RL community. Our code and model are available
at https://github.com/om-ai-lab/VLM-R1
๐ โโVLM-R1 Full Technical Report Released!โโ
We dissect how GRPO incentivizes visual reasoning in VLMs, including lots of lessons learned on reward engineering, data sampling, and generalization. Check it out!