Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
[go: Go Back, main page]

https://polyu-chenlab.github.io/etbench/
Code: https://github.com/PolyU-ChenLab/ETBench

\n","updatedAt":"2024-10-03T09:46:21.860Z","author":{"_id":"6250eb5c1fd03f78d0ae550f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649470297585-noauth.jpeg","fullname":"Ye Liu","name":"yeliudev","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5225909948348999},"editors":["yeliudev"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1649470297585-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"66ff45f8d8ef894b11822bbd","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-10-04T01:33:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs](https://huggingface.co/papers/2409.20063) (2024)\n* [From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding](https://huggingface.co/papers/2409.18938) (2024)\n* [VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs](https://huggingface.co/papers/2409.20365) (2024)\n* [Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding](https://huggingface.co/papers/2409.19532) (2024)\n* [Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding](https://huggingface.co/papers/2409.14485) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-10-04T01:33:44.273Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7031398415565491},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2409.18111","authors":[{"_id":"66f703ea91c723ca3194dcca","user":{"_id":"6250eb5c1fd03f78d0ae550f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649470297585-noauth.jpeg","isPro":false,"fullname":"Ye Liu","user":"yeliudev","type":"user"},"name":"Ye Liu","status":"claimed_verified","statusLastChangedAt":"2024-11-05T07:58:52.585Z","hidden":false},{"_id":"66f703ea91c723ca3194dccb","user":{"_id":"61bb00f6c4ac95d207b25f1b","avatarUrl":"/avatars/3b6eba701d64518d6f694942f5b2e9a9.svg","isPro":false,"fullname":"Zongyang Ma","user":"zyma","type":"user"},"name":"Zongyang Ma","status":"admin_assigned","statusLastChangedAt":"2024-10-09T13:34:04.459Z","hidden":false},{"_id":"66f703ea91c723ca3194dccc","user":{"_id":"660cc64f1b14f691990c0ea0","avatarUrl":"/avatars/f172d3120b22d745a41cc3f2eb499ce6.svg","isPro":false,"fullname":"Zhongang Qi","user":"phoenixqza","type":"user"},"name":"Zhongang Qi","status":"claimed_verified","statusLastChangedAt":"2024-11-19T08:40:07.402Z","hidden":false},{"_id":"66f703ea91c723ca3194dccd","name":"Yang Wu","hidden":false},{"_id":"66f703ea91c723ca3194dcce","name":"Ying Shan","hidden":false},{"_id":"66f703ea91c723ca3194dccf","name":"Chang Wen Chen","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6250eb5c1fd03f78d0ae550f/K-EC07EwUix9zIjrRi8-M.jpeg"],"publishedAt":"2024-09-26T17:53:04.000Z","submittedOnDailyAt":"2024-10-03T08:16:21.851Z","title":"E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding","submittedOnDailyBy":{"_id":"6250eb5c1fd03f78d0ae550f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649470297585-noauth.jpeg","isPro":false,"fullname":"Ye Liu","user":"yeliudev","type":"user"},"summary":"Recent advances in Video Large Language Models (Video-LLMs) have demonstrated\ntheir great potential in general-purpose video understanding. To verify the\nsignificance of these models, a number of benchmarks have been proposed to\ndiagnose their capabilities in different scenarios. However, existing\nbenchmarks merely evaluate models through video-level question-answering,\nlacking fine-grained event-level assessment and task diversity. To fill this\ngap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding\nBenchmark), a large-scale and high-quality benchmark for open-ended event-level\nvideo understanding. Categorized within a 3-level task taxonomy, E.T. Bench\nencompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length)\nunder 8 domains, providing comprehensive evaluations. We extensively evaluated\n8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that\nstate-of-the-art models for coarse-level (video-level) understanding struggle\nto solve our fine-grained tasks, e.g., grounding event-of-interests within\nvideos, largely due to the short video context length, improper time\nrepresentations, and lack of multi-event training data. Focusing on these\nissues, we further propose a strong baseline model, E.T. Chat, together with an\ninstruction-tuning dataset E.T. Instruct 164K tailored for fine-grained\nevent-level understanding. Our simple but effective solution demonstrates\nsuperior performance in multiple scenarios.","upvotes":7,"discussionId":"66f703ec91c723ca3194dd41","githubRepo":"https://github.com/PolyU-ChenLab/ETBench","githubRepoAddedBy":"auto","ai_summary":"E.T. Bench is a comprehensive benchmark for event-level video understanding, highlighting challenges in current models and introducing a strong baseline solution.","ai_keywords":["Video Large Language Models","Video-LLMs","Image-LLMs","video-level question-answering","event-level assessment","task diversity","Event-Level & Time-Sensitive Video Understanding Benchmark","E.T. Bench","task taxonomy","grounding event-of-interests","time representations","multi-event training data","E.T. Chat","E.T. Instruct"],"githubStars":74},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"63f96e99ade090bc87bc2f81","avatarUrl":"/avatars/0dd0807e5b2cec011e97c8d6a3c61bae.svg","isPro":false,"fullname":"hcwei","user":"hcwei","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"},{"_id":"6250eb5c1fd03f78d0ae550f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649470297585-noauth.jpeg","isPro":false,"fullname":"Ye Liu","user":"yeliudev","type":"user"},{"_id":"647e472f33493e1c433c0051","avatarUrl":"/avatars/efbb3b957c650a5264efd58a1da6fb71.svg","isPro":false,"fullname":"Xiang","user":"Ruxun","type":"user"},{"_id":"61bb00f6c4ac95d207b25f1b","avatarUrl":"/avatars/3b6eba701d64518d6f694942f5b2e9a9.svg","isPro":false,"fullname":"Zongyang Ma","user":"zyma","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2409.18111

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

Published on Sep 26, 2024
· Submitted by
Ye Liu
on Oct 3, 2024
Authors:
Ye Liu ,
,
,

Abstract

E.T. Bench is a comprehensive benchmark for event-level video understanding, highlighting challenges in current models and introducing a strong baseline solution.

AI-generated summary

Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.

Community

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.18111 in a Space README.md to link it from this page.

Collections including this paper 2