Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal
Evidence
https://github.com/marinero4972/Open-o3-Video\n","updatedAt":"2025-10-24T03:12:20.653Z","author":{"_id":"65a28e129acab19980226731","avatarUrl":"/avatars/abc3828f807efc4e03837b0eae063f98.svg","fullname":"Jiahao Meng","name":"marinero4972","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5923375487327576},"editors":["marinero4972"],"editorAvatarUrls":["/avatars/abc3828f807efc4e03837b0eae063f98.svg"],"reactions":[{"reaction":"🔥","users":["taesiri","tyfeld","LXT"],"count":3},{"reaction":"❤️","users":["tyfeld","LXT"],"count":2},{"reaction":"🚀","users":["ldkong"],"count":1}],"isReport":false}},{"id":"68fc29a61d895e384705eac8","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-10-25T01:36:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models](https://huggingface.co/papers/2510.05034) (2025)\n* [Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence](https://huggingface.co/papers/2510.20470) (2025)\n* [VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception](https://huggingface.co/papers/2509.21100) (2025)\n* [Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph](https://huggingface.co/papers/2510.10976) (2025)\n* [Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding](https://huggingface.co/papers/2508.20478) (2025)\n* [When Thinking Drifts: Evidential Grounding for Robust Video Reasoning](https://huggingface.co/papers/2510.06077) (2025)\n* [Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning](https://huggingface.co/papers/2510.15440) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-10-25T01:36:38.141Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7154715061187744},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.20579","authors":[{"_id":"68fadb71f158a71c5a2f582a","user":{"_id":"65a28e129acab19980226731","avatarUrl":"/avatars/abc3828f807efc4e03837b0eae063f98.svg","isPro":false,"fullname":"Jiahao Meng","user":"marinero4972","type":"user"},"name":"Jiahao Meng","status":"claimed_verified","statusLastChangedAt":"2025-10-24T16:12:08.198Z","hidden":false},{"_id":"68fadb71f158a71c5a2f582b","user":{"_id":"63958b4414513eaf9029ebf1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/U1g5H071pWRswGAG9UTpo.png","isPro":false,"fullname":"Xiangtai Li","user":"LXT","type":"user"},"name":"Xiangtai Li","status":"claimed_verified","statusLastChangedAt":"2025-12-15T08:14:49.594Z","hidden":false},{"_id":"68fadb71f158a71c5a2f582c","user":{"_id":"6499809cf19fc795e7724e43","avatarUrl":"/avatars/4b3adce8c85e2f3ef05318ded6c89c3e.svg","isPro":false,"fullname":"HaochenWang","user":"HaochenWang","type":"user"},"name":"Haochen Wang","status":"claimed_verified","statusLastChangedAt":"2025-10-24T16:11:58.142Z","hidden":false},{"_id":"68fadb71f158a71c5a2f582d","name":"Yue Tan","hidden":false},{"_id":"68fadb71f158a71c5a2f582e","name":"Tao Zhang","hidden":false},{"_id":"68fadb71f158a71c5a2f582f","user":{"_id":"62df78222d89ce551ce0f71d","avatarUrl":"/avatars/89fba294cff2d2f941d121c1923e4c76.svg","isPro":false,"fullname":"Lingdong Kong","user":"ldkong","type":"user"},"name":"Lingdong Kong","status":"claimed_verified","statusLastChangedAt":"2025-10-24T16:12:02.409Z","hidden":false},{"_id":"68fadb71f158a71c5a2f5830","name":"Yunhai Tong","hidden":false},{"_id":"68fadb71f158a71c5a2f5831","name":"Anran Wang","hidden":false},{"_id":"68fadb71f158a71c5a2f5832","name":"Zhiyang Teng","hidden":false},{"_id":"68fadb71f158a71c5a2f5833","name":"Yujing Wang","hidden":false},{"_id":"68fadb71f158a71c5a2f5834","name":"Zhuochen Wang","hidden":false}],"publishedAt":"2025-10-23T14:05:56.000Z","submittedOnDailyAt":"2025-10-24T00:20:51.680Z","title":"Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal\n Evidence","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"Most video reasoning models only generate textual reasoning traces without\nindicating when and where key evidence appears. Recent models such as OpenAI-o3\nhave sparked wide interest in evidence-centered reasoning for images, yet\nextending this ability to videos is more challenging, as it requires joint\ntemporal tracking and spatial localization across dynamic scenes. We introduce\nOpen-o3 Video, a non-agent framework that integrates explicit spatio-temporal\nevidence into video reasoning, and carefully collect training data and design\ntraining strategies to address the aforementioned challenges. The model\nhighlights key timestamps, objects, and bounding boxes alongside its answers,\nallowing reasoning to be grounded in concrete visual observations. To enable\nthis functionality, we first curate and build two high-quality datasets,\nSTGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed\ntemporal and spatial annotations, since most existing datasets offer either\ntemporal spans for videos or spatial boxes on images, lacking unified\nspatio-temporal supervision and reasoning traces. Then, we adopt a cold-start\nreinforcement learning strategy with multiple specially designed rewards that\njointly encourage answer accuracy, temporal alignment, and spatial precision.\nOn V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance,\nraising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent\nimprovements are also observed on a broad range of video understanding\nbenchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond\naccuracy, the reasoning traces produced by Open-o3 Video also provide valuable\nsignals for test-time scaling, enabling confidence-aware verification and\nimproving answer reliability.","upvotes":56,"discussionId":"68fadb72f158a71c5a2f5835","projectPage":"https://marinero4972.github.io/projects/Open-o3-Video/","githubRepo":"https://github.com/marinero4972/Open-o3-Video","githubRepoAddedBy":"user","ai_summary":"Open-o3 Video integrates spatio-temporal evidence into video reasoning, achieving state-of-the-art performance on multiple benchmarks and providing valuable reasoning traces for test-time scaling.","ai_keywords":["spatio-temporal evidence","video reasoning","temporal tracking","spatial localization","non-agent framework","SFT","RL","cold-start reinforcement learning","V-STAR benchmark","mAM","mLGM","VideoMME","WorldSense","VideoMMMU","TVGBench","reasoning traces","confidence-aware verification"],"githubStars":131,"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"65a28e129acab19980226731","avatarUrl":"/avatars/abc3828f807efc4e03837b0eae063f98.svg","isPro":false,"fullname":"Jiahao Meng","user":"marinero4972","type":"user"},{"_id":"64e357dd825f4133e7427bf8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64e357dd825f4133e7427bf8/HwaWhINrzkbXG6SHG2oyf.jpeg","isPro":false,"fullname":"tyfeld","user":"tyfeld","type":"user"},{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},{"_id":"6436435c2d0ed796669258d3","avatarUrl":"/avatars/d357378eb039391e8ce74bbd84b80d07.svg","isPro":false,"fullname":"zhangtao","user":"zhangtao-whu","type":"user"},{"_id":"63958b4414513eaf9029ebf1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/U1g5H071pWRswGAG9UTpo.png","isPro":false,"fullname":"Xiangtai Li","user":"LXT","type":"user"},{"_id":"6773a00964e8b7438ae809a0","avatarUrl":"/avatars/83e44d89eefdacbec2c0176e9756977b.svg","isPro":false,"fullname":"Li","user":"Xiangtai","type":"user"},{"_id":"676f611f1a736dd43bf8c369","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/676f611f1a736dd43bf8c369/vzy5gQpmjvFL9gvFCHl8J.png","isPro":false,"fullname":"WorldBench","user":"worldbench","type":"user"},{"_id":"6758f3bfacfd817f09825b2e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Y43w6tUKlnvDJxHiKM7tM.png","isPro":false,"fullname":"Drive with VLMs","user":"drive-with-vlms","type":"user"},{"_id":"62f659fc05ca68c0e0fd97dc","avatarUrl":"/avatars/b29c9f5f2f378ea3dfc38ed4b6d442ac.svg","isPro":false,"fullname":"Anonymous Demo","user":"anonymous-demo","type":"user"},{"_id":"68b134a06bd148096dbc9aa7","avatarUrl":"/avatars/9e43a0459e828245de8b5b59cab45f74.svg","isPro":false,"fullname":"Talk2Event","user":"talk2event","type":"user"},{"_id":"62df78222d89ce551ce0f71d","avatarUrl":"/avatars/89fba294cff2d2f941d121c1923e4c76.svg","isPro":false,"fullname":"Lingdong Kong","user":"ldkong","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}}">
Open-o3 Video integrates spatio-temporal evidence into video reasoning, achieving state-of-the-art performance on multiple benchmarks and providing valuable reasoning traces for test-time scaling.
AI-generated summary
Most video reasoning models only generate textual reasoning traces without
indicating when and where key evidence appears. Recent models such as OpenAI-o3
have sparked wide interest in evidence-centered reasoning for images, yet
extending this ability to videos is more challenging, as it requires joint
temporal tracking and spatial localization across dynamic scenes. We introduce
Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal
evidence into video reasoning, and carefully collect training data and design
training strategies to address the aforementioned challenges. The model
highlights key timestamps, objects, and bounding boxes alongside its answers,
allowing reasoning to be grounded in concrete visual observations. To enable
this functionality, we first curate and build two high-quality datasets,
STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed
temporal and spatial annotations, since most existing datasets offer either
temporal spans for videos or spatial boxes on images, lacking unified
spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start
reinforcement learning strategy with multiple specially designed rewards that
jointly encourage answer accuracy, temporal alignment, and spatial precision.
On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance,
raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent
improvements are also observed on a broad range of video understanding
benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond
accuracy, the reasoning traces produced by Open-o3 Video also provide valuable
signals for test-time scaling, enabling confidence-aware verification and
improving answer reliability.
Open-o3 Video introduces a video reasoning framework that grounds its answers in explicit spatio-temporal evidence—highlighting when and where key visual cues occur—achieving state-of-the-art performance across video understanding benchmarks.