Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with
Competitive Performance
https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_omni\n","updatedAt":"2025-06-13T08:47:14.527Z","author":{"_id":"6482dd5ec2ec7df31fd5cfd5","avatarUrl":"/avatars/c74fb04b2b4488a9f3f7872b2f91cd7b.svg","fullname":"qingpei.gqp","name":"qingpei","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.47095662355422974},"editors":["qingpei"],"editorAvatarUrls":["/avatars/c74fb04b2b4488a9f3f7872b2f91cd7b.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.18778","authors":[{"_id":"67fcc31c7b2711bef052c580","user":{"_id":"6482dd5ec2ec7df31fd5cfd5","avatarUrl":"/avatars/c74fb04b2b4488a9f3f7872b2f91cd7b.svg","isPro":false,"fullname":"qingpei.gqp","user":"qingpei","type":"user"},"name":"Qingpei Guo","status":"claimed_verified","statusLastChangedAt":"2025-10-09T08:00:20.212Z","hidden":false},{"_id":"67fcc31c7b2711bef052c581","name":"Kaiyou Song","hidden":false},{"_id":"67fcc31c7b2711bef052c582","name":"Zipeng Feng","hidden":false},{"_id":"67fcc31c7b2711bef052c583","user":{"_id":"66b0780a87f605ac4dbaca64","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/-B1ovhWJW2dO3n8ug1dY_.png","isPro":false,"fullname":"ZipingMa","user":"ZipingMa","type":"user"},"name":"Ziping Ma","status":"claimed_verified","statusLastChangedAt":"2025-06-16T07:17:56.238Z","hidden":false},{"_id":"67fcc31c7b2711bef052c584","name":"Qinglong Zhang","hidden":false},{"_id":"67fcc31c7b2711bef052c585","name":"Sirui Gao","hidden":false},{"_id":"67fcc31c7b2711bef052c586","name":"Xuzheng Yu","hidden":false},{"_id":"67fcc31c7b2711bef052c587","name":"Yunxiao Sun","hidden":false},{"_id":"67fcc31c7b2711bef052c588","name":"Tai-Wei Chang","hidden":false},{"_id":"67fcc31c7b2711bef052c589","name":"Jingdong Chen","hidden":false},{"_id":"67fcc31c7b2711bef052c58a","name":"Ming Yang","hidden":false},{"_id":"67fcc31c7b2711bef052c58b","name":"Jun Zhou","hidden":false}],"publishedAt":"2025-02-26T03:21:12.000Z","title":"M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with\n Competitive Performance","summary":"We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves\ncompetitive performance to GPT-4o. M2-omni employs a unified multimodal\nsequence modeling framework, which empowers Large Language Models(LLMs) to\nacquire comprehensive cross-modal understanding and generation capabilities.\nSpecifically, M2-omni can process arbitrary combinations of audio, video,\nimage, and text modalities as input, generating multimodal sequences\ninterleaving with audio, image, or text outputs, thereby enabling an advanced\nand interactive real-time experience. The training of such an omni-MLLM is\nchallenged by significant disparities in data quantity and convergence rates\nacross modalities. To address these challenges, we propose a step balance\nstrategy during pre-training to handle the quantity disparities in\nmodality-specific data. Additionally, a dynamically adaptive balance strategy\nis introduced during the instruction tuning stage to synchronize the\nmodality-wise training progress, ensuring optimal convergence. Notably, we\nprioritize preserving strong performance on pure text tasks to maintain the\nrobustness of M2-omni's language understanding capability throughout the\ntraining process. To our best knowledge, M2-omni is currently a very\ncompetitive open-source model to GPT-4o, characterized by its comprehensive\nmodality and task support, as well as its exceptional performance. We expect\nM2-omni will advance the development of omni-MLLMs, thus facilitating future\nresearch in this domain.","upvotes":1,"discussionId":"67fcc31f7b2711bef052c648","ai_summary":"M2-omni, an open-source omni-MLLM, supports multiple media types and achieves competitive performance using balanced pre-training and adaptive tuning strategies.","ai_keywords":["omni-MLLM","multimodal sequence modeling","Large Language Models","LLMs","audio","video","image","text","multimodal sequences","step balance strategy","dynamically adaptive balance strategy","pure text tasks"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6482dd5ec2ec7df31fd5cfd5","avatarUrl":"/avatars/c74fb04b2b4488a9f3f7872b2f91cd7b.svg","isPro":false,"fullname":"qingpei.gqp","user":"qingpei","type":"user"}],"acceptLanguages":["*"]}">
M2-omni, an open-source omni-MLLM, supports multiple media types and achieves competitive performance using balanced pre-training and adaptive tuning strategies.
AI-generated summary
We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves
competitive performance to GPT-4o. M2-omni employs a unified multimodal
sequence modeling framework, which empowers Large Language Models(LLMs) to
acquire comprehensive cross-modal understanding and generation capabilities.
Specifically, M2-omni can process arbitrary combinations of audio, video,
image, and text modalities as input, generating multimodal sequences
interleaving with audio, image, or text outputs, thereby enabling an advanced
and interactive real-time experience. The training of such an omni-MLLM is
challenged by significant disparities in data quantity and convergence rates
across modalities. To address these challenges, we propose a step balance
strategy during pre-training to handle the quantity disparities in
modality-specific data. Additionally, a dynamically adaptive balance strategy
is introduced during the instruction tuning stage to synchronize the
modality-wise training progress, ensuring optimal convergence. Notably, we
prioritize preserving strong performance on pure text tasks to maintain the
robustness of M2-omni's language understanding capability throughout the
training process. To our best knowledge, M2-omni is currently a very
competitive open-source model to GPT-4o, characterized by its comprehensive
modality and task support, as well as its exceptional performance. We expect
M2-omni will advance the development of omni-MLLMs, thus facilitating future
research in this domain.