Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Rethinking Chain-of-Thought Reasoning for Videos
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-12-12T01:36:17.758Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7200911045074463},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"69431728632909d722726b00","author":{"_id":"656a3b869ced9d5ff5f7185d","avatarUrl":"/avatars/8f61cceb74b48cc778a8b4df04c9c6cb.svg","fullname":"Deepak Sridhar","name":"DeepakSridhar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-12-17T20:48:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Nice work! Our work also tackles a similar problem where we achieve efficient video reasoning without any training. https://huggingface.co/papers/2510.17045\nWe introduce a training-free method for enhancing video reasoning via value cache optimization during inference that narrows the gap to RL-trained models with fewer output tokens. We plan to discuss your paper in our work as a concurrent method. We would greatly appreciate if you could discuss our paper in a similar vein. ","html":"
Nice work! Our work also tackles a similar problem where we achieve efficient video reasoning without any training. https://huggingface.co/papers/2510.17045 We introduce a training-free method for enhancing video reasoning via value cache optimization during inference that narrows the gap to RL-trained models with fewer output tokens. We plan to discuss your paper in our work as a concurrent method. We would greatly appreciate if you could discuss our paper in a similar vein.
\n","updatedAt":"2025-12-17T20:48:40.377Z","author":{"_id":"656a3b869ced9d5ff5f7185d","avatarUrl":"/avatars/8f61cceb74b48cc778a8b4df04c9c6cb.svg","fullname":"Deepak Sridhar","name":"DeepakSridhar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.927562952041626},"editors":["DeepakSridhar"],"editorAvatarUrls":["/avatars/8f61cceb74b48cc778a8b4df04c9c6cb.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2512.09616","authors":[{"_id":"693ae233e1612ff4bc2a497e","name":"Yiwu Zhong","hidden":false},{"_id":"693ae233e1612ff4bc2a497f","name":"Zi-Yuan Hu","hidden":false},{"_id":"693ae233e1612ff4bc2a4980","name":"Yin Li","hidden":false},{"_id":"693ae233e1612ff4bc2a4981","name":"Liwei Wang","hidden":false}],"publishedAt":"2025-12-10T13:05:55.000Z","submittedOnDailyAt":"2025-12-11T12:55:11.736Z","title":"Rethinking Chain-of-Thought Reasoning for Videos","submittedOnDailyBy":{"_id":"62b4fe7ab25cb80fcf2ffd66","avatarUrl":"/avatars/22f6a05cdbf4224d29ec9259c9fdd7a4.svg","isPro":false,"fullname":"Yiwu Zhong","user":"YiwuZhong","type":"user"},"summary":"Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.","upvotes":19,"discussionId":"693ae234e1612ff4bc2a4982","githubRepo":"https://github.com/LaVi-Lab/Rethink_CoT_Video","githubRepoAddedBy":"user","ai_summary":"Efficient video reasoning can be achieved using concise chains of thought and reduced visual tokens without manual annotations or supervised fine-tuning.","ai_keywords":["chain-of-thought","multimodal large language models","video reasoning","visual tokens","reasoning traces","post-training","inference framework","inference efficiency","benchmarks","human-like CoT reasoning"],"githubStars":20},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62b4fe7ab25cb80fcf2ffd66","avatarUrl":"/avatars/22f6a05cdbf4224d29ec9259c9fdd7a4.svg","isPro":false,"fullname":"Yiwu Zhong","user":"YiwuZhong","type":"user"},{"_id":"62b5613c5383da04f574af58","avatarUrl":"/avatars/2a7dbf81a454a93dd34a3534156aca1c.svg","isPro":false,"fullname":"Wingle","user":"WingK","type":"user"},{"_id":"62b562d61e82c1fbbae95648","avatarUrl":"/avatars/32edf3f55fd22fdc2e38eeff709655ec.svg","isPro":false,"fullname":"Harris","user":"Harrisme","type":"user"},{"_id":"627c3ebb4d0858f0035376f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/627c3ebb4d0858f0035376f2/PTavYWZFKTSDGeB5JjdM1.jpeg","isPro":false,"fullname":"Zi-Yuan Hu","user":"HenryHZY","type":"user"},{"_id":"62be8c7601dc22b4d23458a2","avatarUrl":"/avatars/ca4ce07cc542f600178ce1e24e9a2a79.svg","isPro":false,"fullname":"terminatorface","user":"terminatorface","type":"user"},{"_id":"62b546fd40e600c2273662bc","avatarUrl":"/avatars/f2e5d330a3b20d8789ebf78a3f4bb8fb.svg","isPro":false,"fullname":"Ziv Joy","user":"ZivJoy","type":"user"},{"_id":"64d902fdc2eedf9af8664b21","avatarUrl":"/avatars/7db24280a964869fbddb0a795cf860d3.svg","isPro":false,"fullname":"Yeyao Tao","user":"yeyaotao","type":"user"},{"_id":"646e2fcaf813cfe153f1af6c","avatarUrl":"/avatars/2f87d0e5c071000990da29cd744bc03d.svg","isPro":false,"fullname":"Duo Zheng","user":"zd11024","type":"user"},{"_id":"62e2327f5c0c84966af99ba3","avatarUrl":"/avatars/8570a335cb7491f39d93f46582124f82.svg","isPro":false,"fullname":"Mark Zhao","user":"markzhao","type":"user"},{"_id":"65095591cc02352e1e106d0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65095591cc02352e1e106d0a/QgNhKBItxxHY927pBVBai.jpeg","isPro":false,"fullname":"Tom Lu","user":"eigentom","type":"user"},{"_id":"62d7cd359ccca9c822d1670b","avatarUrl":"/avatars/4a78ad813ec8f9dd64c50a7e98703bdd.svg","isPro":false,"fullname":"sega","user":"sega","type":"user"},{"_id":"654a238a3b78e73b439afb7c","avatarUrl":"/avatars/ee582bd50ca389bece1baff29f08cc78.svg","isPro":false,"fullname":"pangpangxuan","user":"pangxuan","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Efficient video reasoning can be achieved using concise chains of thought and reduced visual tokens without manual annotations or supervised fine-tuning.
AI-generated summary
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.
Nice work! Our work also tackles a similar problem where we achieve efficient video reasoning without any training. https://huggingface.co/papers/2510.17045 We introduce a training-free method for enhancing video reasoning via value cache optimization during inference that narrows the gap to RL-trained models with fewer output tokens. We plan to discuss your paper in our work as a concurrent method. We would greatly appreciate if you could discuss our paper in a similar vein.