Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
[go: Go Back, main page]

https://arxiv.org/pdf/2508.21496
💻Github:https://github.com/hlsv02/ELV-Halluc

\n

We hope our work can contribute to advancing research on video large language models.
Feel free to ⭐️ star, fork, and follow our updates!

\n","updatedAt":"2025-09-03T10:10:16.954Z","author":{"_id":"65c2bef1ce1be49b569745b8","avatarUrl":"/avatars/c3709a914220b353fad170ee12ec3172.svg","fullname":"luhao","name":"HLSv","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7940279841423035},"editors":["HLSv"],"editorAvatarUrls":["/avatars/c3709a914220b353fad170ee12ec3172.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2508.21496","authors":[{"_id":"68b6d3de5aa065c912612750","user":{"_id":"65c2bef1ce1be49b569745b8","avatarUrl":"/avatars/c3709a914220b353fad170ee12ec3172.svg","isPro":false,"fullname":"luhao","user":"HLSv","type":"user"},"name":"Hao Lu","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:28:39.009Z","hidden":false},{"_id":"68b6d3de5aa065c912612751","user":{"_id":"66bc7862aa7cdcb1c31a1efb","avatarUrl":"/avatars/4c2ab907247fe071ff5cdd71c404ca7c.svg","isPro":false,"fullname":"wang jiahao","user":"datamonkey","type":"user"},"name":"Jiahao Wang","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:28:36.907Z","hidden":false},{"_id":"68b6d3de5aa065c912612752","user":{"_id":"648d2e2e514bf0ce32ba729f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648d2e2e514bf0ce32ba729f/VPL1rehLxkvixz5oRD6u_.jpeg","isPro":false,"fullname":"Yaolun Zhang","user":"Mercury7353","type":"user"},"name":"Yaolun Zhang","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:28:34.641Z","hidden":false},{"_id":"68b6d3de5aa065c912612753","name":"Ruohui Wang","hidden":false},{"_id":"68b6d3de5aa065c912612754","name":"Xuanyu Zheng","hidden":false},{"_id":"68b6d3de5aa065c912612755","name":"Yepeng Tang","hidden":false},{"_id":"68b6d3de5aa065c912612756","name":"Dahua Lin","hidden":false},{"_id":"68b6d3de5aa065c912612757","name":"Lewei Lu","hidden":false}],"publishedAt":"2025-08-29T10:25:03.000Z","submittedOnDailyAt":"2025-09-03T08:40:16.948Z","title":"ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long\n Video Understanding","submittedOnDailyBy":{"_id":"65c2bef1ce1be49b569745b8","avatarUrl":"/avatars/c3709a914220b353fad170ee12ec3172.svg","isPro":false,"fullname":"luhao","user":"HLSv","type":"user"},"summary":"Video multimodal large language models (Video-MLLMs) have achieved remarkable\nprogress in video understanding. However, they remain vulnerable to\nhallucination-producing content inconsistent with or unrelated to video inputs.\nPrevious video hallucination benchmarks primarily focus on short-videos. They\nattribute hallucinations to factors such as strong language priors, missing\nframes, or vision-language biases introduced by the visual encoder. While these\ncauses indeed account for most hallucinations in short videos, they still\noversimplify the cause of hallucinations. Sometimes, models generate incorrect\noutputs but with correct frame-level semantics. We refer to this type of\nhallucination as Semantic Aggregation Hallucination (SAH), which arises during\nthe process of aggregating frame-level semantics into event-level semantic\ngroups. Given that SAH becomes particularly critical in long videos due to\nincreased semantic complexity across multiple events, it is essential to\nseparate and thoroughly investigate the causes of this type of hallucination.\nTo address the above issues, we introduce ELV-Halluc, the first benchmark\ndedicated to long-video hallucination, enabling a systematic investigation of\nSAH. Our experiments confirm the existence of SAH and show that it increases\nwith semantic complexity. Additionally, we find that models are more prone to\nSAH on rapidly changing semantics. Moreover, we discuss potential approaches to\nmitigate SAH. We demonstrate that positional encoding strategy contributes to\nalleviating SAH, and further adopt DPO strategy to enhance the model's ability\nto distinguish semantics within and across events. To support this, we curate a\ndataset of 8K adversarial data pairs and achieve improvements on both\nELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.","upvotes":55,"discussionId":"68b6d3de5aa065c912612758","githubRepo":"https://github.com/hlsv02/ELV-Halluc","githubRepoAddedBy":"user","ai_summary":"A benchmark for long-video hallucination identifies and investigates Semantic Aggregation Hallucination (SAH), showing its prevalence in complex and rapidly changing semantic contexts, and proposes strategies to mitigate it.","ai_keywords":["Video-MLLMs","video understanding","hallucination","short-videos","language priors","missing frames","vision-language biases","visual encoder","Semantic Aggregation Hallucination (SAH)","frame-level semantics","event-level semantic groups","long videos","ELV-Halluc","positional encoding","DPO strategy","adversarial data pairs","Video-MME"],"githubStars":9},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66bc7862aa7cdcb1c31a1efb","avatarUrl":"/avatars/4c2ab907247fe071ff5cdd71c404ca7c.svg","isPro":false,"fullname":"wang jiahao","user":"datamonkey","type":"user"},{"_id":"65c2bef1ce1be49b569745b8","avatarUrl":"/avatars/c3709a914220b353fad170ee12ec3172.svg","isPro":false,"fullname":"luhao","user":"HLSv","type":"user"},{"_id":"6583b88d1e065c276dcbc975","avatarUrl":"/avatars/53c3092905a4d0f677b09f410459845e.svg","isPro":false,"fullname":"RW","user":"WRHC","type":"user"},{"_id":"68b6f810c944c888d7ff1f42","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/USl1Y9GTz0bXg3E1BSnry.png","isPro":false,"fullname":"Durian","user":"Mangoteaa","type":"user"},{"_id":"65bb68849170f742fd10aff5","avatarUrl":"/avatars/42a7b3ab2132132d409d39e67f152dec.svg","isPro":false,"fullname":"Keqiang Li","user":"UestcJay","type":"user"},{"_id":"68b6f89dc944c888d7ff2a8d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Z-d2i5EN4r0bEWuPZ11mr.png","isPro":false,"fullname":"new","user":"monica9812","type":"user"},{"_id":"673dd9c663db0f5cbec71036","avatarUrl":"/avatars/b3e067a6077c9c4a39cc1ff0108644ea.svg","isPro":false,"fullname":"Hu","user":"Johnnyyuqi","type":"user"},{"_id":"647d4f1236e109abce409c3b","avatarUrl":"/avatars/d166f5f8be666e96b522a0a0effd21c4.svg","isPro":false,"fullname":"Wenwen Tong","user":"tongww","type":"user"},{"_id":"66ab30dfd456f0408b93f27b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ab30dfd456f0408b93f27b/nps4Kni_eOExO5Z92RiiF.jpeg","isPro":false,"fullname":"Haonan Duan","user":"robot-haonan","type":"user"},{"_id":"648d2e2e514bf0ce32ba729f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648d2e2e514bf0ce32ba729f/VPL1rehLxkvixz5oRD6u_.jpeg","isPro":false,"fullname":"Yaolun Zhang","user":"Mercury7353","type":"user"},{"_id":"677608d8c78ea58e79c98542","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/fN0Dq5XVyaOXeJeiyh2Ji.png","isPro":false,"fullname":"Hao","user":"Penaut","type":"user"},{"_id":"6859f9fe5247f51a4182ee48","avatarUrl":"/avatars/39e30851a9d900e7effbd484e94c4e53.svg","isPro":false,"fullname":"Yang","user":"Cattyhubby","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2508.21496

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Published on Aug 29, 2025
· Submitted by
luhao
on Sep 3, 2025
Authors:
Hao Lu ,
,
,
,
,

Abstract

A benchmark for long-video hallucination identifies and investigates Semantic Aggregation Hallucination (SAH), showing its prevalence in complex and rapidly changing semantic contexts, and proposes strategies to mitigate it.

AI-generated summary

Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.

Community

Paper author Paper submitter

ELV-Halluc is now available!
We are excited to announce the release of ELV-Halluc along with the DPO data. 🚀

📄arxiv Paper: https://arxiv.org/pdf/2508.21496
💻Github:https://github.com/hlsv02/ELV-Halluc

We hope our work can contribute to advancing research on video large language models.
Feel free to ⭐️ star, fork, and follow our updates!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.21496 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.21496 in a Space README.md to link it from this page.

Collections including this paper 4