Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Video-BrowseComp: Benchmarking Agentic Video Research on Open Web
\n","updatedAt":"2026-01-02T14:42:42.665Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5912363529205322},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2512.23044","authors":[{"_id":"695354a689916ff627aa4013","user":{"_id":"642d26c4c5f19fe0da07284a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xJwh9iTDJcHLxaRT0t8OV.png","isPro":false,"fullname":"Zhengyang Liang","user":"chr1ce","type":"user"},"name":"Zhengyang Liang","status":"claimed_verified","statusLastChangedAt":"2026-01-04T20:31:59.883Z","hidden":false},{"_id":"695354a689916ff627aa4014","user":{"_id":"65c4f99b27736b5b86c2cbda","avatarUrl":"/avatars/8789b231ec16073ea0229c28f1f1dd06.svg","isPro":false,"fullname":"Yan Shu","user":"sy1998","type":"user"},"name":"Yan Shu","status":"claimed_verified","statusLastChangedAt":"2025-12-31T20:56:36.143Z","hidden":false},{"_id":"695354a689916ff627aa4015","name":"Xiangrui Liu","hidden":false},{"_id":"695354a689916ff627aa4016","name":"Minghao Qin","hidden":false},{"_id":"695354a689916ff627aa4017","name":"Kaixin Liang","hidden":false},{"_id":"695354a689916ff627aa4018","name":"Paolo Rota","hidden":false},{"_id":"695354a689916ff627aa4019","name":"Nicu Sebe","hidden":false},{"_id":"695354a689916ff627aa401a","name":"Zheng Liu","hidden":false},{"_id":"695354a689916ff627aa401b","name":"Lizi Liao","hidden":false}],"publishedAt":"2025-12-28T19:08:27.000Z","submittedOnDailyAt":"2025-12-30T01:57:28.016Z","title":"Video-BrowseComp: Benchmarking Agentic Video Research on Open Web","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present Video-BrowseComp, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24\\% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.","upvotes":10,"discussionId":"695354a789916ff627aa401c","projectPage":"https://liang-zhengyang.github.io/video-browsecomp/","githubRepo":"https://github.com/chrisx599/Video-Browser","githubRepoAddedBy":"user","ai_summary":"The paper addresses the modality gap in autonomous agents for video processing by introducing a benchmark requiring proactive, open-web video reasoning, revealing limitations of current models in metadata-sparse, dynamic video domains.","ai_keywords":["autonomous agents","temporal visual evidence","Video-BrowseComp","search-augmented models","metadata-rich domains","metadata-sparse environments","visual grounding","proactive video reasoning","passive perception","GPT-5.1 (w/ Search)"],"githubStars":21},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6614ffe1d59f3657d7ee04eb","avatarUrl":"/avatars/51285bbf93f2d7ec235a87250e4a2ffc.svg","isPro":false,"fullname":"minghao qin","user":"CharmingDog","type":"user"},{"_id":"64fde4e252e82dd432b74ce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64fde4e252e82dd432b74ce9/-CQZbBP7FsPPyawYrsi4z.jpeg","isPro":false,"fullname":"Ling Yang","user":"Lingaaaaaaa","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"66e81679d683a3f4e5274803","avatarUrl":"/avatars/40cfb75aa26d54ab5890bc2f19b3d84b.svg","isPro":false,"fullname":"CoryX","user":"CoryX","type":"user"},{"_id":"6564a2ceedae9c33b7654a1f","avatarUrl":"/avatars/42f09356a1282896573ccb44830cd327.svg","isPro":false,"fullname":"JUNJIE ZHOU","user":"JUNJIE99","type":"user"},{"_id":"65c4f99b27736b5b86c2cbda","avatarUrl":"/avatars/8789b231ec16073ea0229c28f1f1dd06.svg","isPro":false,"fullname":"Yan Shu","user":"sy1998","type":"user"},{"_id":"660d68c54ec3738f8a0a1f42","avatarUrl":"/avatars/01dafd56967c30ff617e39d5e1bbafa1.svg","isPro":false,"fullname":"Zhou Long","user":"ZhouLong","type":"user"},{"_id":"642d26c4c5f19fe0da07284a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xJwh9iTDJcHLxaRT0t8OV.png","isPro":false,"fullname":"Zhengyang Liang","user":"chr1ce","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
The paper addresses the modality gap in autonomous agents for video processing by introducing a benchmark requiring proactive, open-web video reasoning, revealing limitations of current models in metadata-sparse, dynamic video domains.
AI-generated summary
The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present Video-BrowseComp, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24\% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
Introduces Video-BrowseComp, a benchmark of 210 open-web agentic video questions requiring temporal visual evidence to test proactive video reasoning in grounded retrieval.