Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
https://github.com/VITA-MLLM/VITA\n","updatedAt":"2025-01-06T04:18:26.366Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9177,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4463653862476349},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"677c8454d75ff0add0e3885e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-01-07T01:33:08.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LMFusion: Adapting Pretrained Language Models for Multimodal Generation](https://huggingface.co/papers/2412.15188) (2024)\n* [WavChat: A Survey of Spoken Dialogue Models](https://huggingface.co/papers/2411.13577) (2024)\n* [V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding](https://huggingface.co/papers/2412.09616) (2024)\n* [Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition](https://huggingface.co/papers/2412.09501) (2024)\n* [Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data](https://huggingface.co/papers/2412.01078) (2024)\n* [SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training](https://huggingface.co/papers/2412.15649) (2024)\n* [Cross-Modal Consistency in Multimodal Large Language Models](https://huggingface.co/papers/2411.09273) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-01-07T01:33:08.442Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7114184498786926},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2501.01957","authors":[{"_id":"677b5979a54b76dcaa4991f9","name":"Chaoyou Fu","hidden":false},{"_id":"677b5979a54b76dcaa4991fa","user":{"_id":"64ffd436d522560505a94b8e","avatarUrl":"/avatars/02d4faac40ac203cb5d635cfcb39780c.svg","isPro":false,"fullname":"Haojia Lin","user":"linhaojia13","type":"user"},"name":"Haojia Lin","status":"admin_assigned","statusLastChangedAt":"2025-01-06T08:24:54.539Z","hidden":false},{"_id":"677b5979a54b76dcaa4991fb","user":{"_id":"664eaf0a98e93ef417c3cc42","avatarUrl":"/avatars/67fb44351cac8964410e5b6549817182.svg","isPro":false,"fullname":"Xiong Wang","user":"xiongwang","type":"user"},"name":"Xiong Wang","status":"admin_assigned","statusLastChangedAt":"2025-01-06T08:25:10.743Z","hidden":false},{"_id":"677b5979a54b76dcaa4991fc","user":{"_id":"623d8ca4c29adf5ef6175615","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/623d8ca4c29adf5ef6175615/q7lHao7UPwU1u7YLSP56m.jpeg","isPro":false,"fullname":"Yi-Fan Zhang","user":"yifanzhang114","type":"user"},"name":"Yi-Fan Zhang","status":"admin_assigned","statusLastChangedAt":"2025-01-06T08:25:17.351Z","hidden":false},{"_id":"677b5979a54b76dcaa4991fd","user":{"_id":"6483143902f98c3f05aff915","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6483143902f98c3f05aff915/ZhWFFgrlRsQf4MXiInh5p.jpeg","isPro":true,"fullname":"沈云航 Yunhang Shen","user":"shenyunhang","type":"user"},"name":"Yunhang Shen","status":"admin_assigned","statusLastChangedAt":"2025-01-06T08:25:29.653Z","hidden":false},{"_id":"677b5979a54b76dcaa4991fe","name":"Xiaoyu Liu","hidden":false},{"_id":"677b5979a54b76dcaa4991ff","name":"Yangze Li","hidden":false},{"_id":"677b5979a54b76dcaa499200","name":"Zuwei Long","hidden":false},{"_id":"677b5979a54b76dcaa499201","user":{"_id":"65ff5dcf82708115869da69a","avatarUrl":"/avatars/10edebbc559e9fb8b0e377c82eba66d4.svg","isPro":false,"fullname":"Heting Gao","user":"hertin","type":"user"},"name":"Heting Gao","status":"admin_assigned","statusLastChangedAt":"2025-01-06T08:26:18.115Z","hidden":false},{"_id":"677b5979a54b76dcaa499202","user":{"_id":"63280915eeee4dd858083092","avatarUrl":"/avatars/78347af4af42527d53e88d9969c5c934.svg","isPro":false,"fullname":"Ke Li","user":"tristanli","type":"user"},"name":"Ke Li","status":"claimed_verified","statusLastChangedAt":"2025-01-09T10:07:25.742Z","hidden":false},{"_id":"677b5979a54b76dcaa499203","user":{"_id":"665d85e35491b1e10d0d5221","avatarUrl":"/avatars/e46ea55b197e2bf8038871ae95f59585.svg","isPro":false,"fullname":"Xiawu Zheng","user":"zhengxiawu","type":"user"},"name":"Xiawu Zheng","status":"admin_assigned","statusLastChangedAt":"2025-01-06T08:26:25.079Z","hidden":false},{"_id":"677b5979a54b76dcaa499204","name":"Rongrong Ji","hidden":false},{"_id":"677b5979a54b76dcaa499205","user":{"_id":"647401e50da364bd0d002f2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/vPuPn7EV092mLBOM2YZXd.png","isPro":false,"fullname":"XING SUN","user":"tedsun","type":"user"},"name":"Xing Sun","status":"admin_assigned","statusLastChangedAt":"2025-01-06T08:30:21.180Z","hidden":false},{"_id":"677b5979a54b76dcaa499206","name":"Caifeng Shan","hidden":false},{"_id":"677b5979a54b76dcaa499207","name":"Ran He","hidden":false}],"publishedAt":"2025-01-03T18:59:52.000Z","submittedOnDailyAt":"2025-01-06T01:48:26.341Z","title":"VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Recent Multimodal Large Language Models (MLLMs) have typically focused on\nintegrating visual and textual modalities, with less emphasis placed on the\nrole of speech in enhancing interaction. However, speech plays a crucial role\nin multimodal dialogue systems, and implementing high-performance in both\nvision and speech tasks remains a significant challenge due to the fundamental\nmodality differences. In this paper, we propose a carefully designed\nmulti-stage training methodology that progressively trains LLM to understand\nboth visual and speech information, ultimately enabling fluent vision and\nspeech interaction. Our approach not only preserves strong vision-language\ncapacity, but also enables efficient speech-to-speech dialogue capabilities\nwithout separate ASR and TTS modules, significantly accelerating multimodal\nend-to-end response speed. By comparing our method against state-of-the-art\ncounterparts across benchmarks for image, video, and speech tasks, we\ndemonstrate that our model is equipped with both strong visual and speech\ncapabilities, making near real-time vision and speech interaction.","upvotes":47,"discussionId":"677b597aa54b76dcaa499262","githubRepo":"https://github.com/VITA-MLLM/VITA","githubRepoAddedBy":"auto","ai_summary":"A multi-stage training method enhances multimodal large language models with both visual and speech capabilities, enabling real-time interaction without separate ASR and TTS modules.","ai_keywords":["Multimodal Large Language Models","multi-stage training","speech-to-speech dialogue","vision-language capacity","ASR","TTS"],"githubStars":2487},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","isPro":true,"fullname":"Yuhao Dong","user":"THUdyh","type":"user"},{"_id":"623d8ca4c29adf5ef6175615","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/623d8ca4c29adf5ef6175615/q7lHao7UPwU1u7YLSP56m.jpeg","isPro":false,"fullname":"Yi-Fan Zhang","user":"yifanzhang114","type":"user"},{"_id":"6448baa3e780dbfc89058bc3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6448baa3e780dbfc89058bc3/vVwmu6aFPa7YTa8qEYexB.jpeg","isPro":false,"fullname":"Haochen Tian","user":"StarBurger","type":"user"},{"_id":"63054f9320668afe24865bba","avatarUrl":"/avatars/75962ffed33d38761bce6c947750e1e4.svg","isPro":false,"fullname":"KW","user":"kevineen","type":"user"},{"_id":"631c386bc73939ffc0716a37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662793811119-noauth.jpeg","isPro":false,"fullname":"SeongWan Kim","user":"idgmatrix","type":"user"},{"_id":"64e8625b21540e1da324b795","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64e8625b21540e1da324b795/pviqKnJoIEJ1zB55Ybj0A.jpeg","isPro":false,"fullname":"sergicalsix","user":"sergicalsix","type":"user"},{"_id":"64a3ff25c2bd8d91c652de94","avatarUrl":"/avatars/da1f9ca9cffd0db2380e3062b4d3f631.svg","isPro":false,"fullname":"Ammar","user":"Ammaqr","type":"user"},{"_id":"67630f55f6580a53d2eaeab1","avatarUrl":"/avatars/098d52e2add9050b3a23f90abcc0060e.svg","isPro":false,"fullname":"Salman","user":"Salman745","type":"user"},{"_id":"6697eb4da1c67d388b05deef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6697eb4da1c67d388b05deef/biaVIbuT63RgvlWDtFI_9.jpeg","isPro":false,"fullname":"Anthonny Olime","user":"Aviv-anthonnyolime","type":"user"},{"_id":"63b2a92e18e5cf2cdd333492","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b2a92e18e5cf2cdd333492/GxnngJG0u7d0jYTEFOrfe.png","isPro":false,"fullname":"Jaehyun Jun","user":"btjhjeon","type":"user"},{"_id":"675c3858ca32e91d535e2a95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/675c3858ca32e91d535e2a95/l-sSKpDfx8qt9yziQj5J0.jpeg","isPro":false,"fullname":"Jiawen Chen","user":"ChenJiawen00","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
A multi-stage training method enhances multimodal large language models with both visual and speech capabilities, enabling real-time interaction without separate ASR and TTS modules.
AI-generated summary
Recent Multimodal Large Language Models (MLLMs) have typically focused on
integrating visual and textual modalities, with less emphasis placed on the
role of speech in enhancing interaction. However, speech plays a crucial role
in multimodal dialogue systems, and implementing high-performance in both
vision and speech tasks remains a significant challenge due to the fundamental
modality differences. In this paper, we propose a carefully designed
multi-stage training methodology that progressively trains LLM to understand
both visual and speech information, ultimately enabling fluent vision and
speech interaction. Our approach not only preserves strong vision-language
capacity, but also enables efficient speech-to-speech dialogue capabilities
without separate ASR and TTS modules, significantly accelerating multimodal
end-to-end response speed. By comparing our method against state-of-the-art
counterparts across benchmarks for image, video, and speech tasks, we
demonstrate that our model is equipped with both strong visual and speech
capabilities, making near real-time vision and speech interaction.