Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
videollama2_arch.py, videollama2_mistral.py, train.py, project page). We will make this clearer in the next version of our technical report. \n","updatedAt":"2024-06-14T05:49:49.695Z","author":{"_id":"63913b120cf6b11c487ca31d","avatarUrl":"/avatars/aec44edd5470dd6e767e0a25efd6fb5d.svg","fullname":"Xin Li","name":"lixin4ever","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":43,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7215504050254822},"editors":["lixin4ever"],"editorAvatarUrls":["/avatars/aec44edd5470dd6e767e0a25efd6fb5d.svg"],"reactions":[{"reaction":"π","users":["fblgit","AdinaY"],"count":2}],"isReport":false,"parentCommentId":"666bcb76fba6c458fcfc7dd4"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2406.07476","authors":[{"_id":"66692323d993003c25001828","name":"Zesen Cheng","hidden":false},{"_id":"66692323d993003c25001829","user":{"_id":"609115c79a8bcaa437b234a9","avatarUrl":"/avatars/1631a91030703d8397133363cf82c863.svg","isPro":false,"fullname":"Leng Sicong","user":"Sicong","type":"user"},"name":"Sicong Leng","status":"admin_assigned","statusLastChangedAt":"2024-06-13T13:05:34.956Z","hidden":false},{"_id":"66692323d993003c2500182a","name":"Hang Zhang","hidden":false},{"_id":"66692323d993003c2500182b","name":"Yifei Xin","hidden":false},{"_id":"66692323d993003c2500182c","user":{"_id":"63913b120cf6b11c487ca31d","avatarUrl":"/avatars/aec44edd5470dd6e767e0a25efd6fb5d.svg","isPro":true,"fullname":"Xin Li","user":"lixin4ever","type":"user"},"name":"Xin Li","status":"claimed_verified","statusLastChangedAt":"2024-06-13T09:31:47.064Z","hidden":false},{"_id":"66692323d993003c2500182d","name":"Guanzheng Chen","hidden":false},{"_id":"66692323d993003c2500182e","user":{"_id":"637c7d8a88699fba70e1e1ff","avatarUrl":"/avatars/661e7acbc36497e1e733715cf6ec212d.svg","isPro":false,"fullname":"yongxinzhu","user":"youngsheen","type":"user"},"name":"Yongxin Zhu","status":"claimed_verified","statusLastChangedAt":"2024-07-12T07:41:53.168Z","hidden":false},{"_id":"66692323d993003c2500182f","name":"Wenqi Zhang","hidden":false},{"_id":"66692323d993003c25001830","user":{"_id":"6090ff099a8bcaa437b234a4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6090ff099a8bcaa437b234a4/iUvw7JXT-ngI7rGk1x-io.jpeg","isPro":false,"fullname":"Ziyang Luo","user":"Ziyang","type":"user"},"name":"Ziyang Luo","status":"admin_assigned","statusLastChangedAt":"2024-06-13T13:07:20.220Z","hidden":false},{"_id":"66692323d993003c25001831","name":"Deli Zhao","hidden":false},{"_id":"66692323d993003c25001832","user":{"_id":"6454685a548f22be598414c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/eMjMWKJ-AouF7eY1-RzGF.jpeg","isPro":false,"fullname":"Lidong Bing","user":"LidongBing","type":"user"},"name":"Lidong Bing","status":"admin_assigned","statusLastChangedAt":"2024-06-13T13:07:33.945Z","hidden":false}],"publishedAt":"2024-06-11T17:22:23.000Z","submittedOnDailyAt":"2024-06-13T01:31:13.673Z","title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio\n Understanding in Video-LLMs","submittedOnDailyBy":{"_id":"63913b120cf6b11c487ca31d","avatarUrl":"/avatars/aec44edd5470dd6e767e0a25efd6fb5d.svg","isPro":true,"fullname":"Xin Li","user":"lixin4ever","type":"user"},"summary":"In this paper, we present the VideoLLaMA 2, a set of Video Large Language\nModels (Video-LLMs) designed to enhance spatial-temporal modeling and audio\nunderstanding in video and audio-oriented tasks. Building upon its predecessor,\nVideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC)\nconnector, which effectively captures the intricate spatial and temporal\ndynamics of video data. Additionally, we integrate an Audio Branch into the\nmodel through joint training, thereby enriching the multimodal understanding\ncapabilities of the model by seamlessly incorporating audio cues. Comprehensive\nevaluations on multiple-choice video question answering (MC-VQA), open-ended\nvideo question answering (OE-VQA), and video captioning (VC) tasks demonstrate\nthat VideoLLaMA 2 consistently achieves competitive results among open-source\nmodels and even gets close to some proprietary models on several benchmarks.\nFurthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and\naudio-video question-answering (AQA & OE-AVQA) benchmarks over existing models.\nThese advancements underline VideoLLaMA 2's superior performance in multimodal\ncomprehension, setting a new standard for intelligent video analysis systems.\nAll models are public to facilitate further research.","upvotes":36,"discussionId":"66692324d993003c25001890","githubRepo":"https://github.com/damo-nlp-sg/videollama2","githubRepoAddedBy":"auto","ai_summary":"VideoLLaMA 2, an enhanced Video Large Language Model, incorporates Spatial-Temporal Convolution and audio cues to achieve competitive performance in multimodal video and audio tasks.","ai_keywords":["Spatial-Temporal Convolution","Video Large Language Models","Video-LLMs","multimodal understanding","multiple-choice video question answering","open-ended video question answering","video captioning","audio-only question answering","audio-video question-answering"],"githubStars":1277},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6541016ea383900657631576","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6541016ea383900657631576/-YoZTaIRJm_SjK6wz9J0l.png","isPro":false,"fullname":"Sebastian Schkudlara","user":"makakoo","type":"user"},{"_id":"609115c79a8bcaa437b234a9","avatarUrl":"/avatars/1631a91030703d8397133363cf82c863.svg","isPro":false,"fullname":"Leng Sicong","user":"Sicong","type":"user"},{"_id":"63913b120cf6b11c487ca31d","avatarUrl":"/avatars/aec44edd5470dd6e767e0a25efd6fb5d.svg","isPro":true,"fullname":"Xin Li","user":"lixin4ever","type":"user"},{"_id":"609653c1146ef3bfe2fc7392","avatarUrl":"/avatars/1639b6552a419209ae67b6562183bc2f.svg","isPro":false,"fullname":"Inui","user":"Norm","type":"user"},{"_id":"6437bde9e282b4a48eaeaa8a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6437bde9e282b4a48eaeaa8a/38pZee1vVvdJKtazsAdYP.png","isPro":false,"fullname":"BlackCatXJ","user":"BlackCatXJ","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"666a80c5e2990b0cb1667d18","avatarUrl":"/avatars/78ce81c1e8a95cc8d76e85624778e0e9.svg","isPro":false,"fullname":"Jiahao Chen","user":"Chen-Jiahao","type":"user"},{"_id":"667b9d8a768e6d3e88ee8ae4","avatarUrl":"/avatars/19b7da700db0b10434f7c7054985c054.svg","isPro":false,"fullname":"GH","user":"gh0904","type":"user"},{"_id":"666a82452c3271b9b5cf2ca6","avatarUrl":"/avatars/0c03a4626d04b2633c8769205b953c4f.svg","isPro":false,"fullname":"MITsys","user":"AiLab-Coco","type":"user"},{"_id":"65df4cd6e5d601130d8f65ea","avatarUrl":"/avatars/979c7dc228cf6599b43159da63a4742a.svg","isPro":false,"fullname":"Wenxin Zheng","user":"wxzheng","type":"user"},{"_id":"666a8912f4c91a0fe5a13b19","avatarUrl":"/avatars/85a640827f151413524690b850e2588f.svg","isPro":false,"fullname":"Adil Ahmad","user":"Adil0728","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
VideoLLaMA 2, an enhanced Video Large Language Model, incorporates Spatial-Temporal Convolution and audio cues to achieve competitive performance in multimodal video and audio tasks.
AI-generated summary
In this paper, we present the VideoLLaMA 2, a set of Video Large Language
Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio
understanding in video and audio-oriented tasks. Building upon its predecessor,
VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC)
connector, which effectively captures the intricate spatial and temporal
dynamics of video data. Additionally, we integrate an Audio Branch into the
model through joint training, thereby enriching the multimodal understanding
capabilities of the model by seamlessly incorporating audio cues. Comprehensive
evaluations on multiple-choice video question answering (MC-VQA), open-ended
video question answering (OE-VQA), and video captioning (VC) tasks demonstrate
that VideoLLaMA 2 consistently achieves competitive results among open-source
models and even gets close to some proprietary models on several benchmarks.
Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and
audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models.
These advancements underline VideoLLaMA 2's superior performance in multimodal
comprehension, setting a new standard for intelligent video analysis systems.
All models are public to facilitate further research.
Yes, the codebase of VideoLLaMA2 is adapted from LLaVA. We have mentioned this and given credit to LLaVA in several places (e.g., videollama2_arch.py, videollama2_mistral.py, train.py, project page). We will make this clearer in the next version of our technical report.