Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
[go: Go Back, main page]

https://github.com/IVGSZ/Flash-VStream

\n","updatedAt":"2024-07-08T10:56:59.929Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9181,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5268412828445435},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"683e5d3dff848ec97ee1d59f","author":{"_id":"6680e0875d1d8125077b87eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6680e0875d1d8125077b87eb/LsAbo-XFlImWzQncoQFuN.jpeg","fullname":"amihan ir","name":"amihanir","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-06-03T02:26:05.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","hiddenReason":"Spam","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-08-21T03:04:37.484Z","author":{"_id":"6680e0875d1d8125077b87eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6680e0875d1d8125077b87eb/LsAbo-XFlImWzQncoQFuN.jpeg","fullname":"amihan ir","name":"amihanir","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"683e5d7620135a6cbbd61c4f","author":{"_id":"6680e0875d1d8125077b87eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6680e0875d1d8125077b87eb/LsAbo-XFlImWzQncoQFuN.jpeg","fullname":"amihan ir","name":"amihanir","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-06-03T02:27:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"👋","html":"

👋

\n","updatedAt":"2025-06-03T02:27:02.369Z","author":{"_id":"6680e0875d1d8125077b87eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6680e0875d1d8125077b87eb/LsAbo-XFlImWzQncoQFuN.jpeg","fullname":"amihan ir","name":"amihanir","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"zh","probability":0.360524445772171},"editors":["amihanir"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6680e0875d1d8125077b87eb/LsAbo-XFlImWzQncoQFuN.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2406.08085","authors":[{"_id":"666aa4f277d119acc29d4ba0","user":{"_id":"65253da6a895f70381f35444","avatarUrl":"/avatars/e5cc8645e949d071887e89af37d1e3a3.svg","isPro":true,"fullname":"Haoji Zhang","user":"zhang9302002","type":"user"},"name":"Haoji Zhang","status":"admin_assigned","statusLastChangedAt":"2024-07-08T21:14:08.479Z","hidden":false},{"_id":"666aa4f277d119acc29d4ba1","user":{"_id":"63e065ae42591dda0b9b74cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e065ae42591dda0b9b74cc/U1zhG8saFcEo8bNewxugP.jpeg","isPro":false,"fullname":"Yiqin Wang","user":"Yiqin","type":"user"},"name":"Yiqin Wang","status":"admin_assigned","statusLastChangedAt":"2024-07-08T21:14:15.995Z","hidden":false},{"_id":"666aa4f277d119acc29d4ba2","name":"Yansong Tang","hidden":false},{"_id":"666aa4f277d119acc29d4ba3","name":"Yong Liu","hidden":false},{"_id":"666aa4f277d119acc29d4ba4","name":"Jiashi Feng","hidden":false},{"_id":"666aa4f277d119acc29d4ba5","user":{"_id":"64686f7172d9180d4ac8b4e4","avatarUrl":"/avatars/db67dd6c4b2b41054ddcce5a18ade6f8.svg","isPro":false,"fullname":"Jifeng Dai","user":"daijifeng","type":"user"},"name":"Jifeng Dai","status":"admin_assigned","statusLastChangedAt":"2024-07-08T21:14:44.589Z","hidden":false},{"_id":"666aa4f277d119acc29d4ba6","name":"Xiaojie Jin","hidden":false}],"publishedAt":"2024-06-12T11:07:55.000Z","submittedOnDailyAt":"2024-07-08T09:26:59.924Z","title":"Flash-VStream: Memory-Based Real-Time Understanding for Long Video\n Streams","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Benefiting from the advancements in large language models and cross-modal\nalignment, existing multi-modal video understanding methods have achieved\nprominent performance in offline scenario. However, online video streams, as\none of the most common media forms in the real world, have seldom received\nattention. Compared to offline videos, the 'dynamic' nature of online video\nstreams poses challenges for the direct application of existing models and\nintroduces new problems, such as the storage of extremely long-term\ninformation, interaction between continuous visual content and 'asynchronous'\nuser questions. Therefore, in this paper we present Flash-VStream, a\nvideo-language model that simulates the memory mechanism of human. Our model is\nable to process extremely long video streams in real-time and respond to user\nqueries simultaneously. Compared to existing models, Flash-VStream achieves\nsignificant reductions in inference latency and VRAM consumption, which is\nintimately related to performing understanding of online streaming video. In\naddition, given that existing video understanding benchmarks predominantly\nconcentrate on offline scenario, we propose VStream-QA, a novel question\nanswering benchmark specifically designed for online video streaming\nunderstanding. Comparisons with popular existing methods on the proposed\nbenchmark demonstrate the superiority of our method for such challenging\nsetting. To verify the generalizability of our approach, we further evaluate it\non existing video understanding benchmarks and achieves state-of-the-art\nperformance in offline scenarios as well. All code, models, and datasets are\navailable at the https://invinciblewyq.github.io/vstream-page/","upvotes":17,"discussionId":"666aa4f477d119acc29d4bfc","ai_summary":"Flash-VStream is a video-language model that efficiently processes online video streams in real-time and responds to user queries, achieving superior performance compared to existing methods on both online and offline video understanding benchmarks.","ai_keywords":["large language models","cross-modal alignment","online video streams","dynamic nature","storage","asynchronous user questions","Flash-VStream","memory mechanism","real-time processing","inference latency","VRAM consumption","understanding of online streaming video","VStream-QA","question answering benchmark","offline scenario","video understanding benchmarks","state-of-the-art performance"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"667138b2424cf851b00f9901","avatarUrl":"/avatars/026c7f8e0b61655c14afc629559079c3.svg","isPro":false,"fullname":"IVGSZ","user":"IVGSZ","type":"user"},{"_id":"63e065ae42591dda0b9b74cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e065ae42591dda0b9b74cc/U1zhG8saFcEo8bNewxugP.jpeg","isPro":false,"fullname":"Yiqin Wang","user":"Yiqin","type":"user"},{"_id":"643966f3bb7ded0a0fef171b","avatarUrl":"/avatars/f49bf8df8b127c2a308367d10af3a30e.svg","isPro":false,"fullname":"Xiaojie Jin","user":"xjjin","type":"user"},{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"_id":"63ddc7b80f6d2d6c3efe3600","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ddc7b80f6d2d6c3efe3600/RX5q9T80Jl3tn6z03ls0l.jpeg","isPro":false,"fullname":"J","user":"dashfunnydashdash","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6362ddb7d3be91534c30bfd6","avatarUrl":"/avatars/dac76ebd3b8a08099497ec0b0524bc7c.svg","isPro":false,"fullname":"Art Atk","user":"ArtAtk","type":"user"},{"_id":"6555125a4f361968f0e3aad7","avatarUrl":"/avatars/e7692d82804338f21ecdc6e731f5c5ea.svg","isPro":false,"fullname":"marinaretikof","user":"marinaretik","type":"user"},{"_id":"651c80a26ba9ab9b9582c273","avatarUrl":"/avatars/e963452eafd21f517d800f2e58e0f918.svg","isPro":false,"fullname":"siyeng feng","user":"siyengfeng","type":"user"},{"_id":"65c4063740d617a14238f3df","avatarUrl":"/avatars/726b1470e46ad71c9ec233f3f0f396ec.svg","isPro":false,"fullname":"Zikun Li","user":"zikun-li","type":"user"},{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","isPro":true,"fullname":"Orr Zohar","user":"orrzohar","type":"user"},{"_id":"668982b13e1066772fba1c8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668982b13e1066772fba1c8f/CqhWo7OVnFL96pXt-GLN3.jpeg","isPro":false,"fullname":"Darrin Mccann","user":"darreen","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2406.08085

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Published on Jun 12, 2024
· Submitted by
AK
on Jul 8, 2024
Authors:
,
,
,

Abstract

Flash-VStream is a video-language model that efficiently processes online video streams in real-time and responds to user queries, achieving superior performance compared to existing methods on both online and offline video understanding benchmarks.

AI-generated summary

Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the 'dynamic' nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and 'asynchronous' user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves significant reductions in inference latency and VRAM consumption, which is intimately related to performing understanding of online streaming video. In addition, given that existing video understanding benchmarks predominantly concentrate on offline scenario, we propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding. Comparisons with popular existing methods on the proposed benchmark demonstrate the superiority of our method for such challenging setting. To verify the generalizability of our approach, we further evaluate it on existing video understanding benchmarks and achieves state-of-the-art performance in offline scenarios as well. All code, models, and datasets are available at the https://invinciblewyq.github.io/vstream-page/

Community

Paper submitter
This comment has been hidden (marked as Spam)

👋

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 2

Collections including this paper 5