Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
[go: Go Back, main page]

https://yihe-deng.notion.site/openvlthinker
🤗Model: https://huggingface.co/ydeng9/OpenVLThinker-7B
💻GitHub: https://github.com/yihedeng9/OpenVLThinker

\n","updatedAt":"2025-03-24T02:01:07.005Z","author":{"_id":"642f4c789b2484d7d8551a93","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642f4c789b2484d7d8551a93/0lH4YXcbZa-Xlzj6ESo7F.jpeg","fullname":"Yihe Deng","name":"ydeng9","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.1679164320230484},"editors":["ydeng9"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/642f4c789b2484d7d8551a93/0lH4YXcbZa-Xlzj6ESo7F.jpeg"],"reactions":[],"isReport":false}},{"id":"67e2082067077e2821831f2c","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-03-25T01:34:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies](https://huggingface.co/papers/2501.17030) (2025)\n* [Demystifying Long Chain-of-Thought Reasoning in LLMs](https://huggingface.co/papers/2502.03373) (2025)\n* [Improving Vision-Language-Action Model with Online Reinforcement Learning](https://huggingface.co/papers/2501.16664) (2025)\n* [Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning](https://huggingface.co/papers/2502.14768) (2025)\n* [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749) (2025)\n* [R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization](https://huggingface.co/papers/2503.10615) (2025)\n* [OThink-MR1: Stimulating multimodal generalized reasoning capabilities through dynamic reinforcement learning](https://huggingface.co/papers/2503.16081) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-03-25T01:34:24.176Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7319114804267883},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.17352","authors":[{"_id":"67e0bcc9e5fa0da84e121032","user":{"_id":"642f4c789b2484d7d8551a93","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642f4c789b2484d7d8551a93/0lH4YXcbZa-Xlzj6ESo7F.jpeg","isPro":true,"fullname":"Yihe Deng","user":"ydeng9","type":"user"},"name":"Yihe Deng","status":"admin_assigned","statusLastChangedAt":"2025-03-24T12:41:41.202Z","hidden":false},{"_id":"67e0bcc9e5fa0da84e121033","user":{"_id":"61c5c25705aa54027c52f7b3","avatarUrl":"/avatars/8a89e040dc331b7a83d9a704c4fc29d2.svg","isPro":false,"fullname":"Hritik Bansal","user":"hbXNov","type":"user"},"name":"Hritik Bansal","status":"admin_assigned","statusLastChangedAt":"2025-03-24T12:41:47.419Z","hidden":false},{"_id":"67e0bcc9e5fa0da84e121034","user":{"_id":"639bd19e445b133a4e94c3ee","avatarUrl":"/avatars/63ab9919b5e67261cce9007192a70deb.svg","isPro":false,"fullname":"Fan Yin","user":"fanyin3639","type":"user"},"name":"Fan Yin","status":"admin_assigned","statusLastChangedAt":"2025-03-24T12:42:02.800Z","hidden":false},{"_id":"67e0bcc9e5fa0da84e121035","name":"Nanyun Peng","hidden":false},{"_id":"67e0bcc9e5fa0da84e121036","user":{"_id":"62fa0ffe0697d224219a0cb7","avatarUrl":"/avatars/f0ef59e1c0cf4ab4fe5cee08d488bd03.svg","isPro":false,"fullname":"Wei Wang","user":"WeiWang","type":"user"},"name":"Wei Wang","status":"admin_assigned","statusLastChangedAt":"2025-03-24T12:42:09.903Z","hidden":false},{"_id":"67e0bcc9e5fa0da84e121037","user":{"_id":"60b7b9d71b90c5d07c23fbd0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1622653364258-noauth.jpeg","isPro":false,"fullname":"Kai-Wei Chang","user":"kaiweichang","type":"user"},"name":"Kai-Wei Chang","status":"admin_assigned","statusLastChangedAt":"2025-03-24T12:42:16.596Z","hidden":false}],"publishedAt":"2025-03-21T17:52:43.000Z","submittedOnDailyAt":"2025-03-24T00:31:06.884Z","title":"OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning\n via Iterative Self-Improvement","submittedOnDailyBy":{"_id":"642f4c789b2484d7d8551a93","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642f4c789b2484d7d8551a93/0lH4YXcbZa-Xlzj6ESo7F.jpeg","isPro":true,"fullname":"Yihe Deng","user":"ydeng9","type":"user"},"summary":"Recent advancements demonstrated by DeepSeek-R1 have shown that complex\nreasoning abilities in large language models (LLMs), including sophisticated\nbehaviors such as self-verification and self-correction, can be achieved by RL\nwith verifiable rewards and significantly improves model performance on\nchallenging tasks such as AIME. Motivated by these findings, our study\ninvestigates whether similar reasoning capabilities can be successfully\nintegrated into large vision-language models (LVLMs) and assesses their impact\non challenging multimodal reasoning tasks. We consider an approach that\niteratively leverages supervised fine-tuning (SFT) on lightweight training data\nand Reinforcement Learning (RL) to further improve model generalization.\nInitially, reasoning capabilities were distilled from pure-text R1 models by\ngenerating reasoning steps using high-quality captions of the images sourced\nfrom diverse visual datasets. Subsequently, iterative RL training further\nenhance reasoning skills, with each iteration's RL-improved model generating\nrefined SFT datasets for the next round. This iterative process yielded\nOpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on\nchallenging benchmarks such as MathVista, MathVerse, and MathVision,\ndemonstrating the potential of our strategy for robust vision-language\nreasoning. The code, model and data are held at\nhttps://github.com/yihedeng9/OpenVLThinker.","upvotes":24,"discussionId":"67e0bccae5fa0da84e121079","projectPage":"https://yihe-deng.notion.site/openvlthinker","githubRepo":"https://github.com/yihedeng9/OpenVLThinker","githubRepoAddedBy":"user","ai_summary":"Iterative supervised fine-tuning and reinforcement learning improve reasoning capabilities in large vision-language models, enhancing performance on multimodal reasoning tasks like MathVista, MathVerse, and MathVision.","ai_keywords":["DeepSeek-R1","reinforcement learning","verifiable rewards","large language models","large vision-language models","supervised fine-tuning","reasoning steps","high-quality captions","visual datasets","MathVista","MathVerse","MathVision"],"githubStars":129},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"642f4c789b2484d7d8551a93","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642f4c789b2484d7d8551a93/0lH4YXcbZa-Xlzj6ESo7F.jpeg","isPro":true,"fullname":"Yihe Deng","user":"ydeng9","type":"user"},{"_id":"63c6782e83ce71db8eda40fb","avatarUrl":"/avatars/f22f0722c3dbcd8c273117062a656301.svg","isPro":false,"fullname":"Mohammed Mohammed Ali","user":"MohammedEltoum","type":"user"},{"_id":"6463554dd2044cd1d7c6e0bf","avatarUrl":"/avatars/d7653623117268c545a7063fec69664b.svg","isPro":false,"fullname":"Bingzheng Wei","user":"Bingzheng","type":"user"},{"_id":"67dd114c74d4b56c8063035b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xhS1RJdONYGpjX6EOlOKV.png","isPro":false,"fullname":"B.K. Lee","user":"vaultOfyoLee","type":"user"},{"_id":"64e7bb81b159a6f87be99459","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64e7bb81b159a6f87be99459/obeVMow9SRRNt722T6ZvU.jpeg","isPro":false,"fullname":"Junkai Zhang","user":"JunkaiZ","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"67a9956b2c088919c1fe4e1d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/y2NlvpVrSCE49AdPMJ_YB.png","isPro":false,"fullname":"xn_deeplearning","user":"xndeeplearning","type":"user"},{"_id":"6721b40b2c9459ac872c5eb7","avatarUrl":"/avatars/c4b3d33e22bae8170db5c5fa25273fe7.svg","isPro":false,"fullname":"Data Explorer","user":"qwerty9904","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6577073fc2bf55b1f6bafb49","avatarUrl":"/avatars/58803398b1a918b7570db17893e65122.svg","isPro":false,"fullname":"Bencheng liao","user":"LegendBC","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2503.17352

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Published on Mar 21, 2025
· Submitted by
Yihe Deng
on Mar 24, 2025
Authors:
,

Abstract

Iterative supervised fine-tuning and reinforcement learning improve reasoning capabilities in large vision-language models, enhancing performance on multimodal reasoning tasks like MathVista, MathVerse, and MathVision.

AI-generated summary

Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration's RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.17352 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.17352 in a Space README.md to link it from this page.

Collections including this paper 12