Excellent work!
\n","updatedAt":"2025-10-31T02:49:13.866Z","author":{"_id":"661de604f8dcbd5a207c9012","avatarUrl":"/avatars/58f5689237dc33972703971642c8c8b1.svg","fullname":"yu","name":"yqi19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7506787180900574},"editors":["yqi19"],"editorAvatarUrls":["/avatars/58f5689237dc33972703971642c8c8b1.svg"],"reactions":[{"reaction":"👍","users":["ZrrSkywalker","AnonymCode"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.26802","authors":[{"_id":"690417142c556835fa67f021","name":"Ziyu Guo","hidden":false},{"_id":"690417142c556835fa67f022","user":{"_id":"647c7a4ed412b3b376572a00","avatarUrl":"/avatars/9cc310fd3f9e3f211475816ed9b0cdaa.svg","isPro":false,"fullname":"Xinyan Chen","user":"xy06","type":"user"},"name":"Xinyan Chen","status":"claimed_verified","statusLastChangedAt":"2025-11-03T20:55:22.903Z","hidden":false},{"_id":"690417142c556835fa67f023","name":"Renrui Zhang","hidden":false},{"_id":"690417142c556835fa67f024","name":"Ruichuan An","hidden":false},{"_id":"690417142c556835fa67f025","user":{"_id":"661de604f8dcbd5a207c9012","avatarUrl":"/avatars/58f5689237dc33972703971642c8c8b1.svg","isPro":false,"fullname":"yu","user":"yqi19","type":"user"},"name":"Yu Qi","status":"claimed_verified","statusLastChangedAt":"2025-10-31T14:27:15.920Z","hidden":false},{"_id":"690417142c556835fa67f026","user":{"_id":"6349214f8146350b3a4c5cdf","avatarUrl":"/avatars/cfd24caac9a87efb528d0f4c375932bc.svg","isPro":false,"fullname":"Dongzhi Jiang","user":"CaraJ","type":"user"},"name":"Dongzhi Jiang","status":"claimed_verified","statusLastChangedAt":"2025-11-03T20:56:16.141Z","hidden":false},{"_id":"690417142c556835fa67f027","user":{"_id":"63958b4414513eaf9029ebf1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/U1g5H071pWRswGAG9UTpo.png","isPro":false,"fullname":"Xiangtai Li","user":"LXT","type":"user"},"name":"Xiangtai Li","status":"claimed_verified","statusLastChangedAt":"2025-12-15T08:14:43.715Z","hidden":false},{"_id":"690417142c556835fa67f028","name":"Manyuan Zhang","hidden":false},{"_id":"690417142c556835fa67f029","name":"Hongsheng Li","hidden":false},{"_id":"690417142c556835fa67f02a","name":"Pheng-Ann Heng","hidden":false}],"publishedAt":"2025-10-30T17:59:55.000Z","submittedOnDailyAt":"2025-10-31T00:26:10.415Z","title":"Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with\n the MME-CoF Benchmark","submittedOnDailyBy":{"_id":"645b8b2687c79b6ec0bb3b7a","avatarUrl":"/avatars/00a9db32a42dc950112bf2593bb109cb.svg","isPro":false,"fullname":"Renrui","user":"ZrrSkywalker","type":"user"},"summary":"Recent video generation models can produce high-fidelity, temporally coherent\nvideos, indicating that they may encode substantial world knowledge. Beyond\nrealistic synthesis, they also exhibit emerging behaviors indicative of visual\nperception, modeling, and manipulation. Yet, an important question still\nremains: Are video models ready to serve as zero-shot reasoners in challenging\nvisual reasoning scenarios? In this work, we conduct an empirical study to\ncomprehensively investigate this question, focusing on the leading and popular\nVeo-3. We evaluate its reasoning behavior across 12 dimensions, including\nspatial, geometric, physical, temporal, and embodied logic, systematically\ncharacterizing both its strengths and failure modes. To standardize this study,\nwe curate the evaluation data into MME-CoF, a compact benchmark that enables\nin-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our\nfindings reveal that while current video models demonstrate promising reasoning\npatterns on short-horizon spatial coherence, fine-grained grounding, and\nlocally consistent dynamics, they remain limited in long-horizon causal\nreasoning, strict geometric constraints, and abstract logic. Overall, they are\nnot yet reliable as standalone zero-shot reasoners, but exhibit encouraging\nsigns as complementary visual engines alongside dedicated reasoning models.\nProject page: https://video-cof.github.io","upvotes":34,"discussionId":"690417152c556835fa67f02b","projectPage":"https://video-cof.github.io/","githubRepo":"https://github.com/ZiyuGuo99/MME-CoF","githubRepoAddedBy":"user","ai_summary":"Video models like Veo-3 show promise in short-term visual reasoning but struggle with long-term causal reasoning and abstract logic, indicating they are not yet reliable standalone zero-shot reasoners.","ai_keywords":["video models","Veo-3","spatial coherence","fine-grained grounding","locally consistent dynamics","long-horizon causal reasoning","geometric constraints","abstract logic","zero-shot reasoners","Chain-of-Frame (CoF) reasoning","MME-CoF benchmark"],"githubStars":84,"organization":{"_id":"6390c6fdd00f25601f445cd4","name":"CUHK-CSE","fullname":"The Chinese University of Hong Kong","avatar":"https://cdn-uploads.huggingface.co/production/uploads/621f2eb36e152b56a7cf0248/o8RRAczRjfNEzq70GzUwQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"645b8b2687c79b6ec0bb3b7a","avatarUrl":"/avatars/00a9db32a42dc950112bf2593bb109cb.svg","isPro":false,"fullname":"Renrui","user":"ZrrSkywalker","type":"user"},{"_id":"652ce0d4c543a08aa92e010f","avatarUrl":"/avatars/7978304e3fe99b0d4d0712441c6a24f3.svg","isPro":false,"fullname":"Haoyu Guo","user":"ghy0324","type":"user"},{"_id":"67b98012bc883af5a6371e1a","avatarUrl":"/avatars/2d573c24d9ce732a2deb0a67339c93bf.svg","isPro":false,"fullname":"Anonym","user":"AnonymCode","type":"user"},{"_id":"643be8879f5d314db2d9ed23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643be8879f5d314db2d9ed23/VrW2UtJ7ppOnGIYjTWd7b.png","isPro":false,"fullname":"Chen Dongping","user":"shuaishuaicdp","type":"user"},{"_id":"661de604f8dcbd5a207c9012","avatarUrl":"/avatars/58f5689237dc33972703971642c8c8b1.svg","isPro":false,"fullname":"yu","user":"yqi19","type":"user"},{"_id":"6341446265e26459e0af4c94","avatarUrl":"/avatars/11c34a1d3b324b871dd934afbe74591a.svg","isPro":false,"fullname":"Mage","user":"arctanx","type":"user"},{"_id":"6627d5bfc1c302386e8dd1ea","avatarUrl":"/avatars/b656a788edc552b7bd6571ba5822cdd3.svg","isPro":false,"fullname":"mewtwo","user":"MewtwoX23","type":"user"},{"_id":"64e436a99ec4cf5000b30c14","avatarUrl":"/avatars/457b5950dd3bc0611ce414e84c71ce98.svg","isPro":false,"fullname":"lin Luo","user":"lyl010221-pku","type":"user"},{"_id":"6349214f8146350b3a4c5cdf","avatarUrl":"/avatars/cfd24caac9a87efb528d0f4c375932bc.svg","isPro":false,"fullname":"Dongzhi Jiang","user":"CaraJ","type":"user"},{"_id":"6489eb3c44cfcffe8f5918e3","avatarUrl":"/avatars/4edffaa045cc736e2edbf858fb6ef7c8.svg","isPro":false,"fullname":"Bocheng Zou","user":"BochengZou","type":"user"},{"_id":"647c7a4ed412b3b376572a00","avatarUrl":"/avatars/9cc310fd3f9e3f211475816ed9b0cdaa.svg","isPro":false,"fullname":"Xinyan Chen","user":"xy06","type":"user"},{"_id":"664b4a748dd1bfb5a3a970fe","avatarUrl":"/avatars/37aa9332ab3e8fbb6ae30b875a7e0e5a.svg","isPro":false,"fullname":"Jiahao Wang","user":"GenuineWWD","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"6390c6fdd00f25601f445cd4","name":"CUHK-CSE","fullname":"The Chinese University of Hong Kong","avatar":"https://cdn-uploads.huggingface.co/production/uploads/621f2eb36e152b56a7cf0248/o8RRAczRjfNEzq70GzUwQ.png"}}">Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Abstract
Video models like Veo-3 show promise in short-term visual reasoning but struggle with long-term causal reasoning and abstract logic, indicating they are not yet reliable standalone zero-shot reasoners.
Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io
Community
Excellent work!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper