Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
\n","updatedAt":"2025-03-10T03:39:12.381Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9177,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3173428773880005},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"67cf93423c6156a8eb75c6f2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-03-11T01:34:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning](https://huggingface.co/papers/2502.19655) (2025)\n* [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2501.12948) (2025)\n* [Learning from Failures in Multi-Attempt Reinforcement Learning](https://huggingface.co/papers/2503.04808) (2025)\n* [R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2503.05592) (2025)\n* [AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO](https://huggingface.co/papers/2502.14669) (2025)\n* [Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https://huggingface.co/papers/2501.11651) (2025)\n* [Visual-RFT: Visual Reinforcement Fine-Tuning](https://huggingface.co/papers/2503.01785) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-03-11T01:34:58.737Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7381060123443604},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.05132","authors":[{"_id":"67ce5ec17c6e6ea1cc5649c2","user":{"_id":"6565ebeda0623adbd76642f3","avatarUrl":"/avatars/5b11f4aabd82ce543ad8db0fe016a0f9.svg","isPro":true,"fullname":"Hengguang Zhou","user":"Dolphin42","type":"user"},"name":"Hengguang Zhou","status":"admin_assigned","statusLastChangedAt":"2025-03-10T10:25:19.828Z","hidden":false},{"_id":"67ce5ec17c6e6ea1cc5649c3","user":{"_id":"6534a434e778506c5b1e5be8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6534a434e778506c5b1e5be8/349SdAnjEdIQJSzWvKfZ4.png","isPro":true,"fullname":"Xirui Li","user":"AIcell","type":"user"},"name":"Xirui Li","status":"admin_assigned","statusLastChangedAt":"2025-03-10T10:25:01.867Z","hidden":false},{"_id":"67ce5ec17c6e6ea1cc5649c4","user":{"_id":"6376e832fe88a92b3a8cacfa","avatarUrl":"/avatars/1ae96b5c5d8b78a08173172dd5569d07.svg","isPro":false,"fullname":"Ruochen Wang","user":"ruocwang","type":"user"},"name":"Ruochen Wang","status":"claimed_verified","statusLastChangedAt":"2025-03-11T08:23:14.306Z","hidden":false},{"_id":"67ce5ec17c6e6ea1cc5649c5","user":{"_id":"67cf1d6310dd1d670adf74ce","avatarUrl":"/avatars/15b3bef32acdc37093e2fc0b32a7b827.svg","isPro":false,"fullname":"Minhao Cheng","user":"cmhcbb","type":"user"},"name":"Minhao Cheng","status":"claimed_verified","statusLastChangedAt":"2025-03-12T08:43:35.575Z","hidden":false},{"_id":"67ce5ec17c6e6ea1cc5649c6","user":{"_id":"647f5af5b0e96764589f3b2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/VJ4cDyjp5M3V5WmI5gPIU.jpeg","isPro":false,"fullname":"Tianyi Zhou","user":"zhoutianyi","type":"user"},"name":"Tianyi Zhou","status":"admin_assigned","statusLastChangedAt":"2025-03-10T10:26:09.708Z","hidden":false},{"_id":"67ce5ec17c6e6ea1cc5649c7","name":"Cho-Jui Hsieh","hidden":false}],"publishedAt":"2025-03-07T04:21:47.000Z","submittedOnDailyAt":"2025-03-10T02:09:12.374Z","title":"R1-Zero's \"Aha Moment\" in Visual Reasoning on a 2B Non-SFT Model","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Recently DeepSeek R1 demonstrated how reinforcement learning with simple\nrule-based incentives can enable autonomous development of complex reasoning in\nlarge language models, characterized by the \"aha moment\", in which the model\nmanifest self-reflection and increased response length during training.\nHowever, attempts to extend this success to multimodal reasoning often failed\nto reproduce these key characteristics. In this report, we present the first\nsuccessful replication of these emergent characteristics for multimodal\nreasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying\nreinforcement learning directly on the SAT dataset, our model achieves 59.47%\naccuracy on CVBench, outperforming the base model by approximately ~30% and\nexceeding both SFT setting by ~2%. In addition, we share our failed attempts\nand insights in attempting to achieve R1-like reasoning using RL with instruct\nmodels. aiming to shed light on the challenges involved. Our key observations\ninclude: (1) applying RL on instruct model often results in trivial reasoning\ntrajectories, and (2) naive length reward are ineffective in eliciting\nreasoning capabilities. The project code is available at\nhttps://github.com/turningpoint-ai/VisualThinker-R1-Zero","upvotes":57,"discussionId":"67ce5ec27c6e6ea1cc564a01","projectPage":"https://turningpointai.notion.site/the-multimodal-aha-moment-on-2b-model","githubRepo":"https://github.com/turningpoint-ai/VisualThinker-R1-Zero","githubRepoAddedBy":"user","ai_summary":"A non-SFT model replicated emergent reasoning characteristics for multimodal tasks using reinforcement learning, achieving higher accuracy than base and SFT models on CVBench.","ai_keywords":["reinforcement learning","large language models","multimodal reasoning","Qwen2-VL-2B","SAT dataset","CVBench","instruct models","trivial reasoning trajectories","naive length reward"],"githubStars":623},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f612b934b8ac9ffa44f084","avatarUrl":"/avatars/6836c122e19c66c90f1673f28b30d7f0.svg","isPro":false,"fullname":"Tang","user":"tommysally","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"647f5af5b0e96764589f3b2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/VJ4cDyjp5M3V5WmI5gPIU.jpeg","isPro":false,"fullname":"Tianyi Zhou","user":"zhoutianyi","type":"user"},{"_id":"6565ebeda0623adbd76642f3","avatarUrl":"/avatars/5b11f4aabd82ce543ad8db0fe016a0f9.svg","isPro":true,"fullname":"Hengguang Zhou","user":"Dolphin42","type":"user"},{"_id":"6534a434e778506c5b1e5be8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6534a434e778506c5b1e5be8/349SdAnjEdIQJSzWvKfZ4.png","isPro":true,"fullname":"Xirui Li","user":"AIcell","type":"user"},{"_id":"67cf1ba042838a7d3a8f0fa2","avatarUrl":"/avatars/82601970c78e434f0e03295b66f9cbc0.svg","isPro":false,"fullname":"Jiaming Jin","user":"Davidjjm","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"67cf1d6310dd1d670adf74ce","avatarUrl":"/avatars/15b3bef32acdc37093e2fc0b32a7b827.svg","isPro":false,"fullname":"Minhao Cheng","user":"cmhcbb","type":"user"},{"_id":"654137252274785dbbbf8e08","avatarUrl":"/avatars/208fcafcb895a7bd7eaf8d436c6bcd7e.svg","isPro":false,"fullname":"Yuanhao Ban","user":"banyh2000","type":"user"},{"_id":"6227a1133988a4f0e9309089","avatarUrl":"/avatars/e10570cfda8c2d6dbecd2a51eedf8799.svg","isPro":false,"fullname":"Andrew Bai","user":"andrewbai","type":"user"},{"_id":"64f0f0fb193de80eda38c1ea","avatarUrl":"/avatars/23afd41a1d1a0555150dc5692f5a5cdd.svg","isPro":false,"fullname":"Tong Xie","user":"txie","type":"user"},{"_id":"65862671e878be571bf9fc52","avatarUrl":"/avatars/b2a1b939f3112b476e7641e0c5fd2dc7.svg","isPro":false,"fullname":"cuijiaxing","user":"cuijiaxing","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A non-SFT model replicated emergent reasoning characteristics for multimodal tasks using reinforcement learning, achieving higher accuracy than base and SFT models on CVBench.
AI-generated summary
Recently DeepSeek R1 demonstrated how reinforcement learning with simple
rule-based incentives can enable autonomous development of complex reasoning in
large language models, characterized by the "aha moment", in which the model
manifest self-reflection and increased response length during training.
However, attempts to extend this success to multimodal reasoning often failed
to reproduce these key characteristics. In this report, we present the first
successful replication of these emergent characteristics for multimodal
reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying
reinforcement learning directly on the SAT dataset, our model achieves 59.47%
accuracy on CVBench, outperforming the base model by approximately ~30% and
exceeding both SFT setting by ~2%. In addition, we share our failed attempts
and insights in attempting to achieve R1-like reasoning using RL with instruct
models. aiming to shed light on the challenges involved. Our key observations
include: (1) applying RL on instruct model often results in trivial reasoning
trajectories, and (2) naive length reward are ineffective in eliciting
reasoning capabilities. The project code is available at
https://github.com/turningpoint-ai/VisualThinker-R1-Zero