Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long
Video Understanding
We hope our work can contribute to advancing research on video large language models. Feel free to ⭐️ star, fork, and follow our updates!
\n","updatedAt":"2025-09-03T10:10:16.954Z","author":{"_id":"65c2bef1ce1be49b569745b8","avatarUrl":"/avatars/c3709a914220b353fad170ee12ec3172.svg","fullname":"luhao","name":"HLSv","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7940279841423035},"editors":["HLSv"],"editorAvatarUrls":["/avatars/c3709a914220b353fad170ee12ec3172.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2508.21496","authors":[{"_id":"68b6d3de5aa065c912612750","user":{"_id":"65c2bef1ce1be49b569745b8","avatarUrl":"/avatars/c3709a914220b353fad170ee12ec3172.svg","isPro":false,"fullname":"luhao","user":"HLSv","type":"user"},"name":"Hao Lu","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:28:39.009Z","hidden":false},{"_id":"68b6d3de5aa065c912612751","user":{"_id":"66bc7862aa7cdcb1c31a1efb","avatarUrl":"/avatars/4c2ab907247fe071ff5cdd71c404ca7c.svg","isPro":false,"fullname":"wang jiahao","user":"datamonkey","type":"user"},"name":"Jiahao Wang","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:28:36.907Z","hidden":false},{"_id":"68b6d3de5aa065c912612752","user":{"_id":"648d2e2e514bf0ce32ba729f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648d2e2e514bf0ce32ba729f/VPL1rehLxkvixz5oRD6u_.jpeg","isPro":false,"fullname":"Yaolun Zhang","user":"Mercury7353","type":"user"},"name":"Yaolun Zhang","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:28:34.641Z","hidden":false},{"_id":"68b6d3de5aa065c912612753","name":"Ruohui Wang","hidden":false},{"_id":"68b6d3de5aa065c912612754","name":"Xuanyu Zheng","hidden":false},{"_id":"68b6d3de5aa065c912612755","name":"Yepeng Tang","hidden":false},{"_id":"68b6d3de5aa065c912612756","name":"Dahua Lin","hidden":false},{"_id":"68b6d3de5aa065c912612757","name":"Lewei Lu","hidden":false}],"publishedAt":"2025-08-29T10:25:03.000Z","submittedOnDailyAt":"2025-09-03T08:40:16.948Z","title":"ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long\n Video Understanding","submittedOnDailyBy":{"_id":"65c2bef1ce1be49b569745b8","avatarUrl":"/avatars/c3709a914220b353fad170ee12ec3172.svg","isPro":false,"fullname":"luhao","user":"HLSv","type":"user"},"summary":"Video multimodal large language models (Video-MLLMs) have achieved remarkable\nprogress in video understanding. However, they remain vulnerable to\nhallucination-producing content inconsistent with or unrelated to video inputs.\nPrevious video hallucination benchmarks primarily focus on short-videos. They\nattribute hallucinations to factors such as strong language priors, missing\nframes, or vision-language biases introduced by the visual encoder. While these\ncauses indeed account for most hallucinations in short videos, they still\noversimplify the cause of hallucinations. Sometimes, models generate incorrect\noutputs but with correct frame-level semantics. We refer to this type of\nhallucination as Semantic Aggregation Hallucination (SAH), which arises during\nthe process of aggregating frame-level semantics into event-level semantic\ngroups. Given that SAH becomes particularly critical in long videos due to\nincreased semantic complexity across multiple events, it is essential to\nseparate and thoroughly investigate the causes of this type of hallucination.\nTo address the above issues, we introduce ELV-Halluc, the first benchmark\ndedicated to long-video hallucination, enabling a systematic investigation of\nSAH. Our experiments confirm the existence of SAH and show that it increases\nwith semantic complexity. Additionally, we find that models are more prone to\nSAH on rapidly changing semantics. Moreover, we discuss potential approaches to\nmitigate SAH. We demonstrate that positional encoding strategy contributes to\nalleviating SAH, and further adopt DPO strategy to enhance the model's ability\nto distinguish semantics within and across events. To support this, we curate a\ndataset of 8K adversarial data pairs and achieve improvements on both\nELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.","upvotes":55,"discussionId":"68b6d3de5aa065c912612758","githubRepo":"https://github.com/hlsv02/ELV-Halluc","githubRepoAddedBy":"user","ai_summary":"A benchmark for long-video hallucination identifies and investigates Semantic Aggregation Hallucination (SAH), showing its prevalence in complex and rapidly changing semantic contexts, and proposes strategies to mitigate it.","ai_keywords":["Video-MLLMs","video understanding","hallucination","short-videos","language priors","missing frames","vision-language biases","visual encoder","Semantic Aggregation Hallucination (SAH)","frame-level semantics","event-level semantic groups","long videos","ELV-Halluc","positional encoding","DPO strategy","adversarial data pairs","Video-MME"],"githubStars":9},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66bc7862aa7cdcb1c31a1efb","avatarUrl":"/avatars/4c2ab907247fe071ff5cdd71c404ca7c.svg","isPro":false,"fullname":"wang jiahao","user":"datamonkey","type":"user"},{"_id":"65c2bef1ce1be49b569745b8","avatarUrl":"/avatars/c3709a914220b353fad170ee12ec3172.svg","isPro":false,"fullname":"luhao","user":"HLSv","type":"user"},{"_id":"6583b88d1e065c276dcbc975","avatarUrl":"/avatars/53c3092905a4d0f677b09f410459845e.svg","isPro":false,"fullname":"RW","user":"WRHC","type":"user"},{"_id":"68b6f810c944c888d7ff1f42","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/USl1Y9GTz0bXg3E1BSnry.png","isPro":false,"fullname":"Durian","user":"Mangoteaa","type":"user"},{"_id":"65bb68849170f742fd10aff5","avatarUrl":"/avatars/42a7b3ab2132132d409d39e67f152dec.svg","isPro":false,"fullname":"Keqiang Li","user":"UestcJay","type":"user"},{"_id":"68b6f89dc944c888d7ff2a8d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Z-d2i5EN4r0bEWuPZ11mr.png","isPro":false,"fullname":"new","user":"monica9812","type":"user"},{"_id":"673dd9c663db0f5cbec71036","avatarUrl":"/avatars/b3e067a6077c9c4a39cc1ff0108644ea.svg","isPro":false,"fullname":"Hu","user":"Johnnyyuqi","type":"user"},{"_id":"647d4f1236e109abce409c3b","avatarUrl":"/avatars/d166f5f8be666e96b522a0a0effd21c4.svg","isPro":false,"fullname":"Wenwen Tong","user":"tongww","type":"user"},{"_id":"66ab30dfd456f0408b93f27b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ab30dfd456f0408b93f27b/nps4Kni_eOExO5Z92RiiF.jpeg","isPro":false,"fullname":"Haonan Duan","user":"robot-haonan","type":"user"},{"_id":"648d2e2e514bf0ce32ba729f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648d2e2e514bf0ce32ba729f/VPL1rehLxkvixz5oRD6u_.jpeg","isPro":false,"fullname":"Yaolun Zhang","user":"Mercury7353","type":"user"},{"_id":"677608d8c78ea58e79c98542","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/fN0Dq5XVyaOXeJeiyh2Ji.png","isPro":false,"fullname":"Hao","user":"Penaut","type":"user"},{"_id":"6859f9fe5247f51a4182ee48","avatarUrl":"/avatars/39e30851a9d900e7effbd484e94c4e53.svg","isPro":false,"fullname":"Yang","user":"Cattyhubby","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A benchmark for long-video hallucination identifies and investigates Semantic Aggregation Hallucination (SAH), showing its prevalence in complex and rapidly changing semantic contexts, and proposes strategies to mitigate it.
AI-generated summary
Video multimodal large language models (Video-MLLMs) have achieved remarkable
progress in video understanding. However, they remain vulnerable to
hallucination-producing content inconsistent with or unrelated to video inputs.
Previous video hallucination benchmarks primarily focus on short-videos. They
attribute hallucinations to factors such as strong language priors, missing
frames, or vision-language biases introduced by the visual encoder. While these
causes indeed account for most hallucinations in short videos, they still
oversimplify the cause of hallucinations. Sometimes, models generate incorrect
outputs but with correct frame-level semantics. We refer to this type of
hallucination as Semantic Aggregation Hallucination (SAH), which arises during
the process of aggregating frame-level semantics into event-level semantic
groups. Given that SAH becomes particularly critical in long videos due to
increased semantic complexity across multiple events, it is essential to
separate and thoroughly investigate the causes of this type of hallucination.
To address the above issues, we introduce ELV-Halluc, the first benchmark
dedicated to long-video hallucination, enabling a systematic investigation of
SAH. Our experiments confirm the existence of SAH and show that it increases
with semantic complexity. Additionally, we find that models are more prone to
SAH on rapidly changing semantics. Moreover, we discuss potential approaches to
mitigate SAH. We demonstrate that positional encoding strategy contributes to
alleviating SAH, and further adopt DPO strategy to enhance the model's ability
to distinguish semantics within and across events. To support this, we curate a
dataset of 8K adversarial data pairs and achieve improvements on both
ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.