Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-08-13T01:32:31.613Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.733392059803009},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2408.04840","authors":[{"_id":"66b9889c430220d6a344681e","user":{"_id":"63cd1e04ff7cd335f0ddfa66","avatarUrl":"/avatars/8cca4ed96c699f53d4daabff0f6d6b56.svg","isPro":false,"fullname":"Jiabo Ye","user":"Mizukiluke","type":"user"},"name":"Jiabo Ye","status":"admin_assigned","statusLastChangedAt":"2025-08-25T18:02:14.896Z","hidden":false},{"_id":"66b9889c430220d6a344681f","user":{"_id":"645b10e80c73ea27d13f7aca","avatarUrl":"/avatars/95e565306472a15067440b5b43e07a6f.svg","isPro":false,"fullname":"xuhaiyang","user":"xhyandwyy","type":"user"},"name":"Haiyang Xu","status":"admin_assigned","statusLastChangedAt":"2024-08-13T10:03:00.246Z","hidden":false},{"_id":"66b9889c430220d6a3446820","name":"Haowei Liu","hidden":false},{"_id":"66b9889c430220d6a3446821","user":{"_id":"6584f2f941dbedb146fbb902","avatarUrl":"/avatars/9b6e1edecce6c1a2dcf4739be2bfd1b4.svg","isPro":false,"fullname":"AnwenHu","user":"AnwenHu","type":"user"},"name":"Anwen Hu","status":"admin_assigned","statusLastChangedAt":"2024-08-13T10:04:25.274Z","hidden":false},{"_id":"66b9889c430220d6a3446822","user":{"_id":"64771cfdd7cf39f2e9381aa9","avatarUrl":"/avatars/48adf00c3b653df02628f80511639e19.svg","isPro":false,"fullname":"Ming","user":"MingYan123","type":"user"},"name":"Ming Yan","status":"admin_assigned","statusLastChangedAt":"2024-08-13T10:04:40.775Z","hidden":false},{"_id":"66b9889c430220d6a3446823","name":"Qi Qian","hidden":false},{"_id":"66b9889c430220d6a3446824","name":"Ji Zhang","hidden":false},{"_id":"66b9889c430220d6a3446825","user":{"_id":"635b8b6a37c6a2c12e2cce00","avatarUrl":"/avatars/229fb72180529141515d1df797b33709.svg","isPro":false,"fullname":"Fei Huang","user":"hzhwcmhf","type":"user"},"name":"Fei Huang","status":"admin_assigned","statusLastChangedAt":"2024-08-13T10:05:40.317Z","hidden":false},{"_id":"66b9889c430220d6a3446826","user":{"_id":"602f88f5e8149a962412a667","avatarUrl":"/avatars/b78f0e583df8e5d5e3365934fe5f4900.svg","isPro":false,"fullname":"Zhou","user":"Jingren","type":"user"},"name":"Jingren Zhou","status":"admin_assigned","statusLastChangedAt":"2024-08-13T10:05:50.021Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/645b10e80c73ea27d13f7aca/G0YIQQS5_S7h6ZCxa-b0d.jpeg","https://cdn-uploads.huggingface.co/production/uploads/645b10e80c73ea27d13f7aca/fhj3FdxlSBdmhL03wcnSy.jpeg"],"publishedAt":"2024-08-09T03:25:42.000Z","submittedOnDailyAt":"2024-08-12T02:30:42.392Z","title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal\n Large Language Models","submittedOnDailyBy":{"_id":"645b10e80c73ea27d13f7aca","avatarUrl":"/avatars/95e565306472a15067440b5b43e07a6f.svg","isPro":false,"fullname":"xuhaiyang","user":"xhyandwyy","type":"user"},"summary":"Multi-modal Large Language Models (MLLMs) have demonstrated remarkable\ncapabilities in executing instructions for a variety of single-image tasks.\nDespite this progress, significant challenges remain in modeling long image\nsequences. In this work, we introduce the versatile multi-modal large language\nmodel, mPLUG-Owl3, which enhances the capability for long image-sequence\nunderstanding in scenarios that incorporate retrieved image-text knowledge,\ninterleaved image-text, and lengthy videos. Specifically, we propose novel\nhyper attention blocks to efficiently integrate vision and language into a\ncommon language-guided semantic space, thereby facilitating the processing of\nextended multi-image scenarios. Extensive experimental results suggest that\nmPLUG-Owl3 achieves state-of-the-art performance among models with a similar\nsize on single-image, multi-image, and video benchmarks. Moreover, we propose a\nchallenging long visual sequence evaluation named Distractor Resistance to\nassess the ability of models to maintain focus amidst distractions. Finally,\nwith the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance\non ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to\nthe development of more efficient and powerful multimodal large language\nmodels.","upvotes":33,"discussionId":"66b9889d430220d6a344687d","ai_summary":"mPLUG-Owl3, a versatile multi-modal large language model, introduces hyper attention blocks for improved integration of vision and language, achieving state-of-the-art performance on single-image, multi-image, and video tasks, particularly in scenarios with long visual sequences.","ai_keywords":["multi-modal large language models","MLLMs","mPLUG-Owl3","hyper attention blocks","vision and language integration","semantic space","Distractor Resistance","long visual sequence inputs"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"645b10e80c73ea27d13f7aca","avatarUrl":"/avatars/95e565306472a15067440b5b43e07a6f.svg","isPro":false,"fullname":"xuhaiyang","user":"xhyandwyy","type":"user"},{"_id":"6660319b253289136b63b219","avatarUrl":"/avatars/19fb26e4d3f8f39487a583b9b887b9a8.svg","isPro":false,"fullname":"XinCheng","user":"ZERO9215","type":"user"},{"_id":"63cd1e04ff7cd335f0ddfa66","avatarUrl":"/avatars/8cca4ed96c699f53d4daabff0f6d6b56.svg","isPro":false,"fullname":"Jiabo Ye","user":"Mizukiluke","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6433b6784b34368fdbfebce8","avatarUrl":"/avatars/00fc60e2ed57eb84a4a0eff386357b8c.svg","isPro":false,"fullname":"Star Bottle","user":"StarBottle","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"62009887577fcfa0ce82f1b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/laLRCNcbJc2bZzf_4MCXR.png","isPro":false,"fullname":"chenyunkuo","user":"yunkchen","type":"user"},{"_id":"61af81009f77f7b669578f95","avatarUrl":"/avatars/fb50773ac49948940eb231834ee6f2fd.svg","isPro":false,"fullname":"rotem israeli","user":"irotem98","type":"user"},{"_id":"635f9fd1ae7144a6674c839b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667211208219-noauth.jpeg","isPro":false,"fullname":"Marcus Gawronsky","user":"marcusinthesky","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3}">
mPLUG-Owl3, a versatile multi-modal large language model, introduces hyper attention blocks for improved integration of vision and language, achieving state-of-the-art performance on single-image, multi-image, and video tasks, particularly in scenarios with long visual sequences.
AI-generated summary
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable
capabilities in executing instructions for a variety of single-image tasks.
Despite this progress, significant challenges remain in modeling long image
sequences. In this work, we introduce the versatile multi-modal large language
model, mPLUG-Owl3, which enhances the capability for long image-sequence
understanding in scenarios that incorporate retrieved image-text knowledge,
interleaved image-text, and lengthy videos. Specifically, we propose novel
hyper attention blocks to efficiently integrate vision and language into a
common language-guided semantic space, thereby facilitating the processing of
extended multi-image scenarios. Extensive experimental results suggest that
mPLUG-Owl3 achieves state-of-the-art performance among models with a similar
size on single-image, multi-image, and video benchmarks. Moreover, we propose a
challenging long visual sequence evaluation named Distractor Resistance to
assess the ability of models to maintain focus amidst distractions. Finally,
with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance
on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to
the development of more efficient and powerful multimodal large language
models.
In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios.