Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
\n","updatedAt":"2025-04-15T03:51:11.989Z","author":{"_id":"6313a86154e6e5d9f0f94e04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg","fullname":"Wenhu Chen","name":"wenhu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":50,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8850191235542297},"editors":["wenhu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg"],"reactions":[],"isReport":false}},{"id":"6800bd22342c291a262ffed0","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-17T08:34:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning](https://huggingface.co/papers/2503.07365) (2025)\n* [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749) (2025)\n* [OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement](https://huggingface.co/papers/2503.17352) (2025)\n* [Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1](https://huggingface.co/papers/2503.24376) (2025)\n* [Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning](https://huggingface.co/papers/2503.07065) (2025)\n* [SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement](https://huggingface.co/papers/2504.07934) (2025)\n* [Video-R1: Reinforcing Video Reasoning in MLLMs](https://huggingface.co/papers/2503.21776) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-04-17T08:34:42.394Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7227903008460999},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.08837","authors":[{"_id":"67fdc483ba0d61664fb0a19d","user":{"_id":"65bf52f0259bc6caeb74f8bf","avatarUrl":"/avatars/b38392e954466df784a5760ded5df804.svg","isPro":false,"fullname":"Haozhe Wang","user":"JasperHaozhe","type":"user"},"name":"Haozhe Wang","status":"claimed_verified","statusLastChangedAt":"2025-04-15T07:54:20.455Z","hidden":false},{"_id":"67fdc483ba0d61664fb0a19e","name":"Chao Qu","hidden":false},{"_id":"67fdc483ba0d61664fb0a19f","user":{"_id":"6772524ed6f92f429bd343a3","avatarUrl":"/avatars/211e0c4641b2d048b0136d7cdeef2483.svg","isPro":false,"fullname":"Zuming Huang","user":"zuminghuang","type":"user"},"name":"Zuming Huang","status":"admin_assigned","statusLastChangedAt":"2025-04-15T08:17:55.505Z","hidden":false},{"_id":"67fdc483ba0d61664fb0a1a0","name":"Wei Chu","hidden":false},{"_id":"67fdc483ba0d61664fb0a1a1","name":"Fangzhen Lin","hidden":false},{"_id":"67fdc483ba0d61664fb0a1a2","user":{"_id":"6313a86154e6e5d9f0f94e04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg","isPro":false,"fullname":"Wenhu Chen","user":"wenhu","type":"user"},"name":"Wenhu Chen","status":"extracted_pending","statusLastChangedAt":"2025-04-15T02:29:24.168Z","hidden":false}],"publishedAt":"2025-04-10T17:41:56.000Z","submittedOnDailyAt":"2025-04-15T01:01:12.820Z","title":"VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models\n with Reinforcement Learning","submittedOnDailyBy":{"_id":"6313a86154e6e5d9f0f94e04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg","isPro":false,"fullname":"Wenhu Chen","user":"wenhu","type":"user"},"summary":"Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated\ngreat potential in solving challenging problems through explicit reflection.\nThey significantly outperform the best fast-thinking models, such as GPT-4o, on\nvarious math and science benchmarks. However, their multimodal reasoning\ncapabilities remain on par with fast-thinking models. For instance, GPT-o1's\nperformance on benchmarks like MathVista, MathVerse, and MathVision is similar\nto fast-thinking models. In this paper, we aim to enhance the slow-thinking\ncapabilities of vision-language models using reinforcement learning (without\nrelying on distillation) to advance the state of the art. First, we adapt the\nGRPO algorithm with a novel technique called Selective Sample Replay (SSR) to\naddress the vanishing advantages problem. While this approach yields strong\nperformance, the resulting RL-trained models exhibit limited self-reflection or\nself-verification. To further encourage slow-thinking, we introduce Forced\nRethinking, which appends a textual rethinking trigger to the end of initial\nrollouts in RL training, explicitly enforcing a self-reflection reasoning step.\nBy combining these two techniques, our model, VL-Rethinker, advances\nstate-of-the-art scores on MathVista, MathVerse, and MathVision to achieve\n80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source\nSoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench,\nnarrowing the gap with GPT-o1.","upvotes":43,"discussionId":"67fdc484ba0d61664fb0a1db","projectPage":"https://tiger-ai-lab.github.io/VL-Rethinker/","githubRepo":"https://github.com/TIGER-AI-Lab/VL-Rethinker","githubRepoAddedBy":"user","ai_summary":"Vision-language models enhanced with reinforcement learning and Forced Rethinking achieve state-of-the-art performance on math and science benchmarks and approach the capabilities of slow-thinking systems.","ai_keywords":["GRPO algorithm","Selective Sample Replay","Forced Rethinking","VL-Rethinker","MathVista","MathVerse","MathVision","MMMU-Pro","EMMA","MEGA-Bench"],"githubStars":182},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6313a86154e6e5d9f0f94e04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg","isPro":false,"fullname":"Wenhu Chen","user":"wenhu","type":"user"},{"_id":"5ec82854968f6028e0559f70","avatarUrl":"/avatars/45b58d912f7d00cb351947cd79d5eeb4.svg","isPro":true,"fullname":"Xueguang Ma","user":"MrLight","type":"user"},{"_id":"63b908d0e3c78740d8e950d0","avatarUrl":"/avatars/3e80075e92aebdfea712f70b00d5ec7d.svg","isPro":true,"fullname":"Yuxuan Zhang","user":"Reacherx","type":"user"},{"_id":"64f8e358766ff9f3d2b0de84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f8e358766ff9f3d2b0de84/R2P1YG-mRBh7TU9wkjGGk.jpeg","isPro":true,"fullname":"Cong Wei","user":"CongWei1230","type":"user"},{"_id":"65358802a920f38780b3248a","avatarUrl":"/avatars/9415510b598079973c2b0436ad12db9c.svg","isPro":false,"fullname":"Ping Nie","user":"pingnieuk","type":"user"},{"_id":"66349404f2c753240d02952a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66349404f2c753240d02952a/xKBKicwyk7BoOITQPwBJn.png","isPro":true,"fullname":"ZhuofengLi","user":"ZhuofengLi","type":"user"},{"_id":"65bf52f0259bc6caeb74f8bf","avatarUrl":"/avatars/b38392e954466df784a5760ded5df804.svg","isPro":false,"fullname":"Haozhe Wang","user":"JasperHaozhe","type":"user"},{"_id":"6772524ed6f92f429bd343a3","avatarUrl":"/avatars/211e0c4641b2d048b0136d7cdeef2483.svg","isPro":false,"fullname":"Zuming Huang","user":"zuminghuang","type":"user"},{"_id":"6455fd90bfdf9c63ce2d32a9","avatarUrl":"/avatars/119255bddd172b862ac78333727a1842.svg","isPro":false,"fullname":"Xiaomeng Zhu","user":"Zhuxmmm","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"62567c86d444a9b5a0ec51c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62567c86d444a9b5a0ec51c1/1vXJf2uGztPcXpkwyTBr6.png","isPro":false,"fullname":"Dongfu Jiang","user":"DongfuJiang","type":"user"},{"_id":"65425316109d784271a4eaf9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65425316109d784271a4eaf9/gXIvvOGcZ7hC9P_AkPm4Z.jpeg","isPro":false,"fullname":"Fisher_SHAO","user":"FisherSHAO","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Vision-language models enhanced with reinforcement learning and Forced Rethinking achieve state-of-the-art performance on math and science benchmarks and approach the capabilities of slow-thinking systems.
AI-generated summary
Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated
great potential in solving challenging problems through explicit reflection.
They significantly outperform the best fast-thinking models, such as GPT-4o, on
various math and science benchmarks. However, their multimodal reasoning
capabilities remain on par with fast-thinking models. For instance, GPT-o1's
performance on benchmarks like MathVista, MathVerse, and MathVision is similar
to fast-thinking models. In this paper, we aim to enhance the slow-thinking
capabilities of vision-language models using reinforcement learning (without
relying on distillation) to advance the state of the art. First, we adapt the
GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to
address the vanishing advantages problem. While this approach yields strong
performance, the resulting RL-trained models exhibit limited self-reflection or
self-verification. To further encourage slow-thinking, we introduce Forced
Rethinking, which appends a textual rethinking trigger to the end of initial
rollouts in RL training, explicitly enforcing a self-reflection reasoning step.
By combining these two techniques, our model, VL-Rethinker, advances
state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve
80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source
SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench,
narrowing the gap with GPT-o1.