$\"overview-2.jpg\"$

\n","updatedAt":"2025-04-15T03:51:11.989Z","author":{"_id":"6313a86154e6e5d9f0f94e04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg","fullname":"Wenhu Chen","name":"wenhu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":50,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8850191235542297},"editors":["wenhu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg"],"reactions":[],"isReport":false}},{"id":"6800bd22342c291a262ffed0","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-17T08:34:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning](https://huggingface.co/papers/2503.07365) (2025)\n* [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749) (2025)\n* [OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement](https://huggingface.co/papers/2503.17352) (2025)\n* [Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1](https://huggingface.co/papers/2503.24376) (2025)\n* [Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning](https://huggingface.co/papers/2503.07065) (2025)\n* [SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement](https://huggingface.co/papers/2504.07934) (2025)\n* [Video-R1: Reinforcing Video Reasoning in MLLMs](https://huggingface.co/papers/2503.21776) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-17T08:34:42.394Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7227903008460999},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.08837","authors":[{"_id":"67fdc483ba0d61664fb0a19d","user":{"_id":"65bf52f0259bc6caeb74f8bf","avatarUrl":"/avatars/b38392e954466df784a5760ded5df804.svg","isPro":false,"fullname":"Haozhe Wang","user":"JasperHaozhe","type":"user"},"name":"Haozhe Wang","status":"claimed_verified","statusLastChangedAt":"2025-04-15T07:54:20.455Z","hidden":false},{"_id":"67fdc483ba0d61664fb0a19e","name":"Chao Qu","hidden":false},{"_id":"67fdc483ba0d61664fb0a19f","user":{"_id":"6772524ed6f92f429bd343a3","avatarUrl":"/avatars/211e0c4641b2d048b0136d7cdeef2483.svg","isPro":false,"fullname":"Zuming Huang","user":"zuminghuang","type":"user"},"name":"Zuming Huang","status":"admin_assigned","statusLastChangedAt":"2025-04-15T08:17:55.505Z","hidden":false},{"_id":"67fdc483ba0d61664fb0a1a0","name":"Wei Chu","hidden":false},{"_id":"67fdc483ba0d61664fb0a1a1","name":"Fangzhen Lin","hidden":false},{"_id":"67fdc483ba0d61664fb0a1a2","user":{"_id":"6313a86154e6e5d9f0f94e04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg","isPro":false,"fullname":"Wenhu Chen","user":"wenhu","type":"user"},"name":"Wenhu Chen","status":"extracted_pending","statusLastChangedAt":"2025-04-15T02:29:24.168Z","hidden":false}],"publishedAt":"2025-04-10T17:41:56.000Z","submittedOnDailyAt":"2025-04-15T01:01:12.820Z","title":"VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models\n with Reinforcement Learning","submittedOnDailyBy":{"_id":"6313a86154e6e5d9f0f94e04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg","isPro":false,"fullname":"Wenhu Chen","user":"wenhu","type":"user"},"summary":"Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated\ngreat potential in solving challenging problems through explicit reflection.\nThey significantly outperform the best fast-thinking models, such as GPT-4o, on\nvarious math and science benchmarks. However, their multimodal reasoning\ncapabilities remain on par with fast-thinking models. For instance, GPT-o1's\nperformance on benchmarks like MathVista, MathVerse, and MathVision is similar\nto fast-thinking models. In this paper, we aim to enhance the slow-thinking\ncapabilities of vision-language models using reinforcement learning (without\nrelying on distillation) to advance the state of the art. First, we adapt the\nGRPO algorithm with a novel technique called Selective Sample Replay (SSR) to\naddress the vanishing advantages problem. While this approach yields strong\nperformance, the resulting RL-trained models exhibit limited self-reflection or\nself-verification. To further encourage slow-thinking, we introduce Forced\nRethinking, which appends a textual rethinking trigger to the end of initial\nrollouts in RL training, explicitly enforcing a self-reflection reasoning step.\nBy combining these two techniques, our model, VL-Rethinker, advances\nstate-of-the-art scores on MathVista, MathVerse, and MathVision to achieve\n80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source\nSoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench,\nnarrowing the gap with GPT-o1.","upvotes":43,"discussionId":"67fdc484ba0d61664fb0a1db","projectPage":"https://tiger-ai-lab.github.io/VL-Rethinker/","githubRepo":"https://github.com/TIGER-AI-Lab/VL-Rethinker","githubRepoAddedBy":"user","ai_summary":"Vision-language models enhanced with reinforcement learning and Forced Rethinking achieve state-of-the-art performance on math and science benchmarks and approach the capabilities of slow-thinking systems.","ai_keywords":["GRPO algorithm","Selective Sample Replay","Forced Rethinking","VL-Rethinker","MathVista","MathVerse","MathVision","MMMU-Pro","EMMA","MEGA-Bench"],"githubStars":182},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6313a86154e6e5d9f0f94e04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg","isPro":false,"fullname":"Wenhu Chen","user":"wenhu","type":"user"},{"_id":"5ec82854968f6028e0559f70","avatarUrl":"/avatars/45b58d912f7d00cb351947cd79d5eeb4.svg","isPro":true,"fullname":"Xueguang Ma","user":"MrLight","type":"user"},{"_id":"63b908d0e3c78740d8e950d0","avatarUrl":"/avatars/3e80075e92aebdfea712f70b00d5ec7d.svg","isPro":true,"fullname":"Yuxuan Zhang","user":"Reacherx","type":"user"},{"_id":"64f8e358766ff9f3d2b0de84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f8e358766ff9f3d2b0de84/R2P1YG-mRBh7TU9wkjGGk.jpeg","isPro":true,"fullname":"Cong Wei","user":"CongWei1230","type":"user"},{"_id":"65358802a920f38780b3248a","avatarUrl":"/avatars/9415510b598079973c2b0436ad12db9c.svg","isPro":false,"fullname":"Ping Nie","user":"pingnieuk","type":"user"},{"_id":"66349404f2c753240d02952a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66349404f2c753240d02952a/xKBKicwyk7BoOITQPwBJn.png","isPro":true,"fullname":"ZhuofengLi","user":"ZhuofengLi","type":"user"},{"_id":"65bf52f0259bc6caeb74f8bf","avatarUrl":"/avatars/b38392e954466df784a5760ded5df804.svg","isPro":false,"fullname":"Haozhe Wang","user":"JasperHaozhe","type":"user"},{"_id":"6772524ed6f92f429bd343a3","avatarUrl":"/avatars/211e0c4641b2d048b0136d7cdeef2483.svg","isPro":false,"fullname":"Zuming Huang","user":"zuminghuang","type":"user"},{"_id":"6455fd90bfdf9c63ce2d32a9","avatarUrl":"/avatars/119255bddd172b862ac78333727a1842.svg","isPro":false,"fullname":"Xiaomeng Zhu","user":"Zhuxmmm","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"62567c86d444a9b5a0ec51c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62567c86d444a9b5a0ec51c1/1vXJf2uGztPcXpkwyTBr6.png","isPro":false,"fullname":"Dongfu Jiang","user":"DongfuJiang","type":"user"},{"_id":"65425316109d784271a4eaf9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65425316109d784271a4eaf9/gXIvvOGcZ7hC9P_AkPm4Z.jpeg","isPro":false,"fullname":"Fisher_SHAO","user":"FisherSHAO","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2504.08837

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Published on Apr 10, 2025

· Submitted by

Wenhu Chen on Apr 15, 2025

Upvote

Authors:

Haozhe Wang ,

Zuming Huang ,

Wenhu Chen

Abstract

Vision-language models enhanced with reinforcement learning and Forced Rethinking achieve state-of-the-art performance on math and science benchmarks and approach the capabilities of slow-thinking systems.

AI-generated summary

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1.