https://github.com/SkyworkAI/Skywork-R1V

\n","updatedAt":"2025-04-09T07:02:09.333Z","author":{"_id":"6462b241b438438da3c25a5d","avatarUrl":"/avatars/606a67f1be639c9a5e36f293abd5f27a.svg","fullname":"Xuchen Song","name":"xuchensong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.758763313293457},"editors":["xuchensong"],"editorAvatarUrls":["/avatars/606a67f1be639c9a5e36f293abd5f27a.svg"],"reactions":[],"isReport":false}},{"id":"67f6205f0216b064dddb6a19","author":{"_id":"620f5a1c3f76c50e6458a9b6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620f5a1c3f76c50e6458a9b6/pXh_f5F0UvufxuUa-eS-v.jpeg","fullname":"Peiyu Wang","name":"OrlandoHugBot","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2025-04-09T07:23:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"impressive work！","html":"

impressive work！

\n","updatedAt":"2025-04-09T07:23:11.661Z","author":{"_id":"620f5a1c3f76c50e6458a9b6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620f5a1c3f76c50e6458a9b6/pXh_f5F0UvufxuUa-eS-v.jpeg","fullname":"Peiyu Wang","name":"OrlandoHugBot","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6146093606948853},"editors":["OrlandoHugBot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/620f5a1c3f76c50e6458a9b6/pXh_f5F0UvufxuUa-eS-v.jpeg"],"reactions":[],"isReport":false}},{"id":"67f720300af2a99ba78e13c4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-10T01:34:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749) (2025)\n* [R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization](https://huggingface.co/papers/2503.10615) (2025)\n* [Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning](https://huggingface.co/papers/2503.13360) (2025)\n* [Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1](https://huggingface.co/papers/2503.24376) (2025)\n* [VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models](https://huggingface.co/papers/2503.23064) (2025)\n* [SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces](https://huggingface.co/papers/2503.01754) (2025)\n* [Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)](https://huggingface.co/papers/2504.03151) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-10T01:34:40.730Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7119080424308777},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.05599","authors":[{"_id":"67f61a98af81b0685bf055cf","name":"Yi Peng","hidden":false},{"_id":"67f61a98af81b0685bf055d0","user":{"_id":"620f5a1c3f76c50e6458a9b6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620f5a1c3f76c50e6458a9b6/pXh_f5F0UvufxuUa-eS-v.jpeg","isPro":true,"fullname":"Peiyu Wang","user":"OrlandoHugBot","type":"user"},"name":"Chris","status":"admin_assigned","statusLastChangedAt":"2025-04-09T19:12:45.482Z","hidden":false},{"_id":"67f61a98af81b0685bf055d1","user":{"_id":"62be9b5aae56e75e4d689e7c","avatarUrl":"/avatars/6772bc09d6eeb4e86b1210481be91720.svg","isPro":false,"fullname":"wangxiaokun","user":"shawn0wang","type":"user"},"name":"Xiaokun Wang","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:29:59.248Z","hidden":false},{"_id":"67f61a98af81b0685bf055d2","user":{"_id":"66d3ff488da15c5151c372fb","avatarUrl":"/avatars/4e3ed8b675c822e768e17def7604f0d9.svg","isPro":false,"fullname":"Yichen Wei","user":"rockman24","type":"user"},"name":"Yichen Wei","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:30:08.089Z","hidden":false},{"_id":"67f61a98af81b0685bf055d3","user":{"_id":"653df83feefdcdc9fd8a5b13","avatarUrl":"/avatars/f7f5bc67df7f8bf8b28e9bbd4fa6c397.svg","isPro":false,"fullname":"Jiangbo Pei","user":"jiangbop","type":"user"},"name":"Jiangbo Pei","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:30:30.139Z","hidden":false},{"_id":"67f61a98af81b0685bf055d4","user":{"_id":"660aab2c878289c5b34f9e97","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660aab2c878289c5b34f9e97/yxx1-lR8x5o6KaEpZDXQq.jpeg","isPro":false,"fullname":"weijie qiu","user":"qiuwj","type":"user"},"name":"Weijie Qiu","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:30:37.481Z","hidden":false},{"_id":"67f61a98af81b0685bf055d5","user":{"_id":"67c7fc3de3f9241ddeb3fe18","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/LHUAn-z0yDYdKmQFKeQiw.png","isPro":false,"fullname":"JianAi","user":"AIJian","type":"user"},"name":"Ai Jian","status":"claimed_verified","statusLastChangedAt":"2025-04-10T13:23:26.167Z","hidden":false},{"_id":"67f61a98af81b0685bf055d6","user":{"_id":"653dd16277c2f09452ad37cd","avatarUrl":"/avatars/a95f9527722845a5414d86180c8e945d.svg","isPro":false,"fullname":"Yunzhuo Hao","user":"luckychao","type":"user"},"name":"Yunzhuo Hao","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:30:44.019Z","hidden":false},{"_id":"67f61a98af81b0685bf055d7","name":"Jiachun Pan","hidden":false},{"_id":"67f61a98af81b0685bf055d8","user":{"_id":"63fdb1aa27abbe6b3ce098f5","avatarUrl":"/avatars/c22e3a77ff84b3b87c16cff2469f6d3d.svg","isPro":false,"fullname":"xietian","user":"sealical","type":"user"},"name":"Tianyidan Xie","status":"claimed_verified","statusLastChangedAt":"2025-04-10T06:45:36.870Z","hidden":false},{"_id":"67f61a98af81b0685bf055d9","name":"Li Ge","hidden":false},{"_id":"67f61a98af81b0685bf055da","name":"Rongxian Zhuang","hidden":false},{"_id":"67f61a98af81b0685bf055db","user":{"_id":"6462b241b438438da3c25a5d","avatarUrl":"/avatars/606a67f1be639c9a5e36f293abd5f27a.svg","isPro":false,"fullname":"Xuchen Song","user":"xuchensong","type":"user"},"name":"Xuchen Song","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:31:17.507Z","hidden":false},{"_id":"67f61a98af81b0685bf055dc","name":"Yang Liu","hidden":false},{"_id":"67f61a98af81b0685bf055dd","name":"Yahui Zhou","hidden":false}],"publishedAt":"2025-04-08T01:19:20.000Z","submittedOnDailyAt":"2025-04-09T05:32:09.323Z","title":"Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought","submittedOnDailyBy":{"_id":"6462b241b438438da3c25a5d","avatarUrl":"/avatars/606a67f1be639c9a5e36f293abd5f27a.svg","isPro":false,"fullname":"Xuchen Song","user":"xuchensong","type":"user"},"summary":"We introduce Skywork R1V, a multimodal reasoning model extending the an\nR1-series Large language models (LLM) to visual modalities via an efficient\nmultimodal transfer method. Leveraging a lightweight visual projector, Skywork\nR1V facilitates seamless multimodal adaptation without necessitating retraining\nof either the foundational language model or the vision encoder. To strengthen\nvisual-text alignment, we propose a hybrid optimization strategy that combines\nIterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization\n(GRPO), significantly enhancing cross-modal integration efficiency.\nAdditionally, we introduce an adaptive-length Chain-of-Thought distillation\napproach for reasoning data generation. This approach dynamically optimizes\nreasoning chain lengths, thereby enhancing inference efficiency and preventing\nexcessive reasoning overthinking. Empirical evaluations demonstrate that\nSkywork R1V, with only 38B parameters, delivers competitive performance,\nachieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista.\nMeanwhile, it maintains robust textual reasoning performance, evidenced by\nimpressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model\nweights have been publicly released to promote openness and reproducibility.","upvotes":85,"discussionId":"67f61a9daf81b0685bf05731","githubRepo":"https://github.com/SkyworkAI/Skywork-R1V","githubRepoAddedBy":"user","ai_summary":"Skywork R1V extends large language models to multimodal reasoning with efficient transfer, enhanced visual-text alignment, and dynamic reasoning chain optimization, achieving competitive performance in various benchmarks.","ai_keywords":["multimodal reasoning model","R1-series Large language models","multimodal transfer method","lightweight visual projector","Iterative Supervised Fine-Tuning","Group Relative Policy Optimization","adaptive-length Chain-of-Thought distillation","MMMU benchmark","MathVista","AIME","MATH500"],"githubStars":3150,"organization":{"_id":"6522615d9334173c627b0efa","name":"Skywork","fullname":"Skywork","avatar":"https://cdn-uploads.huggingface.co/production/uploads/64535b71bcbd25618f7655da/AvtJ4GuPAyhLxl2-leVt6.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6462b241b438438da3c25a5d","avatarUrl":"/avatars/606a67f1be639c9a5e36f293abd5f27a.svg","isPro":false,"fullname":"Xuchen Song","user":"xuchensong","type":"user"},{"_id":"619b03a080ebe7c9091fbf3c","avatarUrl":"/avatars/0b4be841601195cc73d984055ffab565.svg","isPro":false,"fullname":"Hu Dou Dou","user":"hl0737","type":"user"},{"_id":"612cfc6e1f69b222aacf831b","avatarUrl":"/avatars/b6c7d15ebc7b5dd4b56620bfab324c77.svg","isPro":false,"fullname":"lycfight","user":"lycfight","type":"user"},{"_id":"620f5a1c3f76c50e6458a9b6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620f5a1c3f76c50e6458a9b6/pXh_f5F0UvufxuUa-eS-v.jpeg","isPro":true,"fullname":"Peiyu Wang","user":"OrlandoHugBot","type":"user"},{"_id":"63fdb1aa27abbe6b3ce098f5","avatarUrl":"/avatars/c22e3a77ff84b3b87c16cff2469f6d3d.svg","isPro":false,"fullname":"xietian","user":"sealical","type":"user"},{"_id":"673f0a5bcdad8a9744d17df0","avatarUrl":"/avatars/413b0472c9790395a64aafe9294143bd.svg","isPro":false,"fullname":"Yichen Wei","user":"yichenchenchen","type":"user"},{"_id":"653dd16277c2f09452ad37cd","avatarUrl":"/avatars/a95f9527722845a5414d86180c8e945d.svg","isPro":false,"fullname":"Yunzhuo Hao","user":"luckychao","type":"user"},{"_id":"62be9b5aae56e75e4d689e7c","avatarUrl":"/avatars/6772bc09d6eeb4e86b1210481be91720.svg","isPro":false,"fullname":"wangxiaokun","user":"shawn0wang","type":"user"},{"_id":"67c8145f5999e7df91a2f8b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/nAtdMj3n1dV8IkhcPNeUe.png","isPro":false,"fullname":"skyipeng","user":"skyipeng","type":"user"},{"_id":"660aab2c878289c5b34f9e97","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660aab2c878289c5b34f9e97/yxx1-lR8x5o6KaEpZDXQq.jpeg","isPro":false,"fullname":"weijie qiu","user":"qiuwj","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"67f61e1459f6e8c3698a84a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/eRTXFseUr13f-aaY3LXgn.png","isPro":false,"fullname":"peng bin","user":"pengdott","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3,"organization":{"_id":"6522615d9334173c627b0efa","name":"Skywork","fullname":"Skywork","avatar":"https://cdn-uploads.huggingface.co/production/uploads/64535b71bcbd25618f7655da/AvtJ4GuPAyhLxl2-leVt6.jpeg"}}">

Papers

arxiv:2504.05599

Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

Published on Apr 8, 2025

· Submitted by

Xuchen Song on Apr 9, 2025

#3 Paper of the day

Skywork

Upvote

Authors:

Chris ,

Xiaokun Wang ,

Yichen Wei ,

Jiangbo Pei ,

Weijie Qiu ,

Ai Jian ,

Yunzhuo Hao ,

Tianyidan Xie ,

Xuchen Song ,

Abstract

Skywork R1V extends large language models to multimodal reasoning with efficient transfer, enhanced visual-text alignment, and dynamic reasoning chain optimization, achieving competitive performance in various benchmarks.

AI-generated summary

We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.

View arXiv page View PDF GitHub 3.15k Add to collection

Community

xuchensong

Paper author Paper submitter Apr 9, 2025

Skywork R1V: an open-sourced 38B multimodal reasoning model extending R1-series LLMs to vision via efficient transfer, hybrid SFT+GRPO training, and adaptive CoT distillation—69.0 on MMMU, 67.5 on MathVista, with strong math reasoning. Model weights are open-sourced! #AI #LLM #Multimodal

https://github.com/SkyworkAI/Skywork-R1V