Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-16T01:38:30.264Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.710273802280426},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.09536","authors":[{"_id":"6968787c0ac10a06522f6aca","name":"Dongjie Cheng","hidden":false},{"_id":"6968787c0ac10a06522f6acb","name":"Yongqi Li","hidden":false},{"_id":"6968787c0ac10a06522f6acc","name":"Zhixin Ma","hidden":false},{"_id":"6968787c0ac10a06522f6acd","user":{"_id":"653a111eee5888edef9182cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653a111eee5888edef9182cf/7jQ08JDk2UEBla91At8QR.jpeg","isPro":false,"fullname":"Hongru Cai","user":"HongruCai","type":"user"},"name":"Hongru Cai","status":"claimed_verified","statusLastChangedAt":"2026-01-27T09:44:09.688Z","hidden":false},{"_id":"6968787c0ac10a06522f6ace","name":"Yupeng Hu","hidden":false},{"_id":"6968787c0ac10a06522f6acf","name":"Wenjie Wang","hidden":false},{"_id":"6968787c0ac10a06522f6ad0","name":"Liqiang Nie","hidden":false},{"_id":"6968787c0ac10a06522f6ad1","name":"Wenjie Li","hidden":false}],"publishedAt":"2026-01-14T14:57:33.000Z","submittedOnDailyAt":"2026-01-15T09:47:29.628Z","title":"Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning","submittedOnDailyBy":{"_id":"64ac01f8ac13891808807e01","avatarUrl":"/avatars/25a2e2ad943581521ad488d00bf37738.svg","isPro":true,"fullname":"charlie","user":"charlesdj","type":"user"},"summary":"Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning.","upvotes":5,"discussionId":"6968787c0ac10a06522f6ad2","projectPage":"https://github.com/ModalityDance/Omni-R1","githubRepo":"https://github.com/ModalityDance/Omni-R1","githubRepoAddedBy":"auto","ai_summary":"Unified generative multimodal reasoning approach enables diverse reasoning skills through intermediate image generation, with a two-stage SFT+RL framework and a text-only bootstrapping variant.","ai_keywords":["Multimodal Large Language Models","multimodal reasoning","generative multimodal reasoning","perception alignment loss","perception reward","SFT+RL framework","intermediate image generation","Omni-R1","Omni-R1-Zero"],"githubStars":49,"organization":{"_id":"69396d0f6ef210a3d45ac4b7","name":"ModalityDance","fullname":"ModalityDance","avatar":"https://cdn-uploads.huggingface.co/production/uploads/653a111eee5888edef9182cf/7BPn5_PnfH27PkAaLQnxW.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"674038313ccfb67446ae2b35","avatarUrl":"/avatars/8a3c0fdf971363988731f9eb8b13658c.svg","isPro":false,"fullname":"tensorslow","user":"tensorslow","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"},{"_id":"64ac01f8ac13891808807e01","avatarUrl":"/avatars/25a2e2ad943581521ad488d00bf37738.svg","isPro":true,"fullname":"charlie","user":"charlesdj","type":"user"},{"_id":"662df184d1055e6b3b9865fa","avatarUrl":"/avatars/4370ab30a5523573a4caf2499d9943a1.svg","isPro":false,"fullname":"huyu wu","user":"whhhy","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"69396d0f6ef210a3d45ac4b7","name":"ModalityDance","fullname":"ModalityDance","avatar":"https://cdn-uploads.huggingface.co/production/uploads/653a111eee5888edef9182cf/7BPn5_PnfH27PkAaLQnxW.png"}}">
Papers
arxiv:2601.09536

Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

Published on Jan 14
· Submitted by
charlie
on Jan 15
Authors:
,
,
,
,
,
,

Abstract

Unified generative multimodal reasoning approach enables diverse reasoning skills through intermediate image generation, with a two-stage SFT+RL framework and a text-only bootstrapping variant.

AI-generated summary

Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning.

Community

Paper submitter

This paper proposes a unified generative multimodal reasoning paradigm, using a two-stage SFT+RL framework with perception alignment loss and perception reward, and explores bootstrapping step-wise visualizations from text-only reasoning data when multimodal annotation availability is extremely limited.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.09536 in a Space README.md to link it from this page.

Collections including this paper 3