Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
[go: Go Back, main page]

https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_omni

\n","updatedAt":"2025-06-13T08:47:14.527Z","author":{"_id":"6482dd5ec2ec7df31fd5cfd5","avatarUrl":"/avatars/c74fb04b2b4488a9f3f7872b2f91cd7b.svg","fullname":"qingpei.gqp","name":"qingpei","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.47095662355422974},"editors":["qingpei"],"editorAvatarUrls":["/avatars/c74fb04b2b4488a9f3f7872b2f91cd7b.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.18778","authors":[{"_id":"67fcc31c7b2711bef052c580","user":{"_id":"6482dd5ec2ec7df31fd5cfd5","avatarUrl":"/avatars/c74fb04b2b4488a9f3f7872b2f91cd7b.svg","isPro":false,"fullname":"qingpei.gqp","user":"qingpei","type":"user"},"name":"Qingpei Guo","status":"claimed_verified","statusLastChangedAt":"2025-10-09T08:00:20.212Z","hidden":false},{"_id":"67fcc31c7b2711bef052c581","name":"Kaiyou Song","hidden":false},{"_id":"67fcc31c7b2711bef052c582","name":"Zipeng Feng","hidden":false},{"_id":"67fcc31c7b2711bef052c583","user":{"_id":"66b0780a87f605ac4dbaca64","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/-B1ovhWJW2dO3n8ug1dY_.png","isPro":false,"fullname":"ZipingMa","user":"ZipingMa","type":"user"},"name":"Ziping Ma","status":"claimed_verified","statusLastChangedAt":"2025-06-16T07:17:56.238Z","hidden":false},{"_id":"67fcc31c7b2711bef052c584","name":"Qinglong Zhang","hidden":false},{"_id":"67fcc31c7b2711bef052c585","name":"Sirui Gao","hidden":false},{"_id":"67fcc31c7b2711bef052c586","name":"Xuzheng Yu","hidden":false},{"_id":"67fcc31c7b2711bef052c587","name":"Yunxiao Sun","hidden":false},{"_id":"67fcc31c7b2711bef052c588","name":"Tai-Wei Chang","hidden":false},{"_id":"67fcc31c7b2711bef052c589","name":"Jingdong Chen","hidden":false},{"_id":"67fcc31c7b2711bef052c58a","name":"Ming Yang","hidden":false},{"_id":"67fcc31c7b2711bef052c58b","name":"Jun Zhou","hidden":false}],"publishedAt":"2025-02-26T03:21:12.000Z","title":"M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with\n Competitive Performance","summary":"We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves\ncompetitive performance to GPT-4o. M2-omni employs a unified multimodal\nsequence modeling framework, which empowers Large Language Models(LLMs) to\nacquire comprehensive cross-modal understanding and generation capabilities.\nSpecifically, M2-omni can process arbitrary combinations of audio, video,\nimage, and text modalities as input, generating multimodal sequences\ninterleaving with audio, image, or text outputs, thereby enabling an advanced\nand interactive real-time experience. The training of such an omni-MLLM is\nchallenged by significant disparities in data quantity and convergence rates\nacross modalities. To address these challenges, we propose a step balance\nstrategy during pre-training to handle the quantity disparities in\nmodality-specific data. Additionally, a dynamically adaptive balance strategy\nis introduced during the instruction tuning stage to synchronize the\nmodality-wise training progress, ensuring optimal convergence. Notably, we\nprioritize preserving strong performance on pure text tasks to maintain the\nrobustness of M2-omni's language understanding capability throughout the\ntraining process. To our best knowledge, M2-omni is currently a very\ncompetitive open-source model to GPT-4o, characterized by its comprehensive\nmodality and task support, as well as its exceptional performance. We expect\nM2-omni will advance the development of omni-MLLMs, thus facilitating future\nresearch in this domain.","upvotes":1,"discussionId":"67fcc31f7b2711bef052c648","ai_summary":"M2-omni, an open-source omni-MLLM, supports multiple media types and achieves competitive performance using balanced pre-training and adaptive tuning strategies.","ai_keywords":["omni-MLLM","multimodal sequence modeling","Large Language Models","LLMs","audio","video","image","text","multimodal sequences","step balance strategy","dynamically adaptive balance strategy","pure text tasks"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6482dd5ec2ec7df31fd5cfd5","avatarUrl":"/avatars/c74fb04b2b4488a9f3f7872b2f91cd7b.svg","isPro":false,"fullname":"qingpei.gqp","user":"qingpei","type":"user"}],"acceptLanguages":["*"]}">
Papers
arxiv:2502.18778

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

Published on Feb 26, 2025
Authors:
,
,
,
,
,
,
,
,
,

Abstract

M2-omni, an open-source omni-MLLM, supports multiple media types and achieves competitive performance using balanced pre-training and adaptive tuning strategies.

AI-generated summary

We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni's language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.18778 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.18778 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.18778 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.