hi \n\n@ZrrSkywalker\n\t \n\n@Haozhangcx\n\t , congrats on your work ๐ฅ
It would be great if you could link the model, dataset and demo with the paper. You can follow the guide here: https://huggingface.co/docs/hub/en/paper-pages#linking-a-paper-to-a-model-dataset-or-space.
I've linked the HF Transformers compatible models to the hub, you can find them in this collection: https://huggingface.co/collections/lmms-lab/llava-next-interleave-66763c55c411b340b35873d1
\n","updatedAt":"2024-07-12T13:20:48.215Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8256795406341553},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[{"reaction":"๐ฅ","users":["AdinaY"],"count":1}],"isReport":false,"parentCommentId":"668fecf2f553f07586efd98e"}}]},{"id":"686801554d42dfdae38247cb","author":{"_id":"68642c6700ecb8d1657fdd2c","avatarUrl":"/avatars/8774dc3a41a10be30ec78d58ed0071c4.svg","fullname":"hana gamracy","name":"hana-gamracy1","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-07-04T16:29:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"\n \n","html":"\n","updatedAt":"2025-07-04T16:29:09.438Z","author":{"_id":"68642c6700ecb8d1657fdd2c","avatarUrl":"/avatars/8774dc3a41a10be30ec78d58ed0071c4.svg","fullname":"hana gamracy","name":"hana-gamracy1","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3582904040813446},"editors":["hana-gamracy1"],"editorAvatarUrls":["/avatars/8774dc3a41a10be30ec78d58ed0071c4.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.07895","authors":[{"_id":"668f4cd8c0af06ce53c13a26","name":"Feng Li","hidden":false},{"_id":"668f4cd8c0af06ce53c13a27","user":{"_id":"645b8b2687c79b6ec0bb3b7a","avatarUrl":"/avatars/00a9db32a42dc950112bf2593bb109cb.svg","isPro":false,"fullname":"Renrui","user":"ZrrSkywalker","type":"user"},"name":"Renrui Zhang","status":"claimed_verified","statusLastChangedAt":"2024-07-11T08:25:47.769Z","hidden":false},{"_id":"668f4cd8c0af06ce53c13a28","user":{"_id":"642e5a066748dd4f8eebcd16","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e5a066748dd4f8eebcd16/7ZYI0_8Z8cfsiG6r7_Y-p.jpeg","isPro":false,"fullname":"Hao Zhang","user":"Haozhangcx","type":"user"},"name":"Hao Zhang","status":"claimed_verified","statusLastChangedAt":"2024-07-11T13:45:41.888Z","hidden":false},{"_id":"668f4cd8c0af06ce53c13a29","user":{"_id":"62a993d80472c0b7f94027df","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a993d80472c0b7f94027df/j5vp-IwLA2YBexylUHiQU.png","isPro":false,"fullname":"Zhang Yuanhan","user":"ZhangYuanhan","type":"user"},"name":"Yuanhan Zhang","status":"admin_assigned","statusLastChangedAt":"2024-07-11T08:56:09.879Z","hidden":false},{"_id":"668f4cd8c0af06ce53c13a2a","user":{"_id":"62d3f7d84b0933c48f3cdd9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d3f7d84b0933c48f3cdd9c/Tab1vxtxLatWzXS8NVIyo.png","isPro":true,"fullname":"Bo Li","user":"luodian","type":"user"},"name":"Bo Li","status":"admin_assigned","statusLastChangedAt":"2024-07-11T08:57:01.301Z","hidden":false},{"_id":"668f4cd8c0af06ce53c13a2b","name":"Wei Li","hidden":false},{"_id":"668f4cd8c0af06ce53c13a2c","name":"Zejun Ma","hidden":false},{"_id":"668f4cd8c0af06ce53c13a2d","user":{"_id":"62aba526cae4462c0c6caa0f","avatarUrl":"/avatars/430560ec2c2547f819225769ab432f30.svg","isPro":false,"fullname":"Chunyuan Li","user":"Chunyuan24","type":"user"},"name":"Chunyuan Li","status":"admin_assigned","statusLastChangedAt":"2024-07-11T08:56:18.129Z","hidden":false}],"publishedAt":"2024-07-10T17:59:43.000Z","submittedOnDailyAt":"2024-07-11T01:39:20.423Z","title":"LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large\n Multimodal Models","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Visual instruction tuning has made considerable strides in enhancing the\ncapabilities of Large Multimodal Models (LMMs). However, existing open LMMs\nlargely focus on single-image tasks, their applications to multi-image\nscenarios remains less explored. Additionally, prior LMM research separately\ntackles different scenarios, leaving it impossible to generalize cross\nscenarios with new emerging capabilities. To this end, we introduce\nLLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame\n(video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To\nenable these capabilities, we regard the interleaved data format as a general\ntemplate and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4\nprimary domains with 14 tasks and 41 datasets. We also curate the\nLLaVA-Interleave Bench to comprehensively evaluate the multi-image performance\nof LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading\nresults in multi-image, video, and 3D benchmarks, while maintaining the\nperformance of single-image tasks. Besides, our model also exhibits several\nemerging capabilities, e.g., transferring tasks across different settings and\nmodalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT","upvotes":42,"discussionId":"668f4cdac0af06ce53c13adc","ai_summary":"LLaVA-NeXT-Interleave addresses multi-image, multi-frame, multi-view, and multi-patch scenarios in LMMs using the M4-Instruct dataset and LLaVA-Interleave Bench to achieve leading results while maintaining single-image performance.","ai_keywords":["Large Multimodal Models","LLMs","Multi-image","Multi-frame","Multi-view","Multi-patch","interleaved data format","M4-Instruct dataset","LLaVA-Interleave Bench"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6141a88b3a0ec78603c9e784","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg","isPro":true,"fullname":"merve","user":"merve","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"60aef0fbee40717d1a8fa6a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624676266012-60aef0fbee40717d1a8fa6a5.png","isPro":false,"fullname":"Mayank Bhaskar","user":"cataluna84","type":"user"},{"_id":"635cada2c017767a629db012","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667018139063-noauth.jpeg","isPro":false,"fullname":"Ojasvi Singh Yadav","user":"ojasvisingh786","type":"user"},{"_id":"645ac72ec35da9c7afd833cf","avatarUrl":"/avatars/fe15dd9bf42b8fb67236f1f8bad0df53.svg","isPro":false,"fullname":"Jaleel Akinyemi","user":"JAkinyemi","type":"user"},{"_id":"62d3146d498762ed5f3f8fde","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658000598608-62d3146d498762ed5f3f8fde.jpeg","isPro":false,"fullname":"Ryan Conrad","user":"conradry","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"645b8b2687c79b6ec0bb3b7a","avatarUrl":"/avatars/00a9db32a42dc950112bf2593bb109cb.svg","isPro":false,"fullname":"Renrui","user":"ZrrSkywalker","type":"user"},{"_id":"616fb788e2ad27af26561b1a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675485317568-616fb788e2ad27af26561b1a.jpeg","isPro":false,"fullname":"Xiao Xu","user":"LooperXX","type":"user"},{"_id":"656d41258a37acfa3f1f284a","avatarUrl":"/avatars/520e72488441bd3eb35f152fbb6a9ba8.svg","isPro":false,"fullname":"feng li","user":"fenly","type":"user"},{"_id":"647d9ab61a1fcad2fdbf2d3d","avatarUrl":"/avatars/48c8aeae8979d2c87df8bde922437d62.svg","isPro":true,"fullname":"Ziyu Guo","user":"ZiyuG","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3}">LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Abstract
LLaVA-NeXT-Interleave addresses multi-image, multi-frame, multi-view, and multi-patch scenarios in LMMs using the M4-Instruct dataset and LLaVA-Interleave Bench to achieve leading results while maintaining single-image performance.
Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT
Community
hi
@ZrrSkywalker
@Haozhangcx
, congrats on your work ๐ฅ
It would be great if you could link the model, dataset and demo with the paper. You can follow the guide here: https://huggingface.co/docs/hub/en/paper-pages#linking-a-paper-to-a-model-dataset-or-space.
I've linked the HF Transformers compatible models to the hub, you can find them in this collection: https://huggingface.co/collections/lmms-lab/llava-next-interleave-66763c55c411b340b35873d1
Models citing this paper 8
Browse 8 models citing this paperDatasets citing this paper 0
No dataset linking this paper