Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
[go: Go Back, main page]

https://thinking-with-video.github.io/

\n","updatedAt":"2025-11-09T07:02:49.767Z","author":{"_id":"6690e13ccbcaf7ab0ec1c971","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/e8KDV6J29tviXlIpLZPq6.png","fullname":"Tony.Li","name":"lkdhy","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6610910296440125},"editors":["lkdhy"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/e8KDV6J29tviXlIpLZPq6.png"],"reactions":[],"isReport":false,"parentCommentId":"690d5df1017ab906dc95056f"}},{"id":"6916484a74b567e045893e11","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false},"createdAt":"2025-11-13T21:06:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/thinking-with-video-video-generation-as-a-promising-multimodal-reasoning-paradigm","html":"

arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/thinking-with-video-video-generation-as-a-promising-multimodal-reasoning-paradigm

\n","updatedAt":"2025-11-13T21:06:18.895Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6725009679794312},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[],"isReport":false,"parentCommentId":"690d5df1017ab906dc95056f"}}]},{"id":"690e9e0aed5ac2e2e4c2608b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-11-08T01:34:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought](https://huggingface.co/papers/2511.02779) (2025)\n* [CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images](https://huggingface.co/papers/2510.11718) (2025)\n* [ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models](https://huggingface.co/papers/2510.01582) (2025)\n* [ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning](https://huggingface.co/papers/2510.27492) (2025)\n* [GIR-Bench: Versatile Benchmark for Generating Images with Reasoning](https://huggingface.co/papers/2510.11026) (2025)\n* [Video-Thinker: Sparking\"Thinking with Videos\"via Reinforcement Learning](https://huggingface.co/papers/2510.23473) (2025)\n* [TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning](https://huggingface.co/papers/2511.01833) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-11-08T01:34:02.008Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7178504467010498},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"694ad79a61707ccb93645f44","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-12-23T17:55:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/thinking-with-video-video-generation-as-a-promising-multimodal-reasoning-paradigm-2944-57971210\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications\n","html":"

arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/thinking-with-video-video-generation-as-a-promising-multimodal-reasoning-paradigm-2944-57971210

\n
    \n
  • Executive Summary
  • \n
  • Detailed Breakdown
  • \n
  • Practical Applications
  • \n
\n","updatedAt":"2025-12-23T17:55:38.117Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6811773180961609},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2511.04570","authors":[{"_id":"690d5b51ad2597bf6c464ca9","name":"Jingqi Tong","hidden":false},{"_id":"690d5b51ad2597bf6c464caa","name":"Yurong Mou","hidden":false},{"_id":"690d5b51ad2597bf6c464cab","name":"Hangcheng Li","hidden":false},{"_id":"690d5b51ad2597bf6c464cac","user":{"_id":"65e4134472e748aae53e24f3","avatarUrl":"/avatars/3346f4f4cdbffc4c51276be01a6c5f10.svg","isPro":false,"fullname":"Mingzhe Li","user":"Mubuky","type":"user"},"name":"Mingzhe Li","status":"claimed_verified","statusLastChangedAt":"2025-11-07T10:28:47.508Z","hidden":false},{"_id":"690d5b51ad2597bf6c464cad","user":{"_id":"65ab2dd614d782df061265cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ab2dd614d782df061265cd/7T8kMx0wFNTa5zsVQrnOr.jpeg","isPro":false,"fullname":"Yongzhuo Yang","user":"YangYongzhuo","type":"user"},"name":"Yongzhuo Yang","status":"claimed_verified","statusLastChangedAt":"2025-11-10T09:31:39.261Z","hidden":false},{"_id":"690d5b51ad2597bf6c464cae","user":{"_id":"65b71c0582d38451342f7334","avatarUrl":"/avatars/f9763a0ac361c350e6c6732e23564567.svg","isPro":false,"fullname":"Ming Zhang","user":"konglongge","type":"user"},"name":"Ming Zhang","status":"claimed_verified","statusLastChangedAt":"2026-01-07T13:14:06.398Z","hidden":false},{"_id":"690d5b51ad2597bf6c464caf","user":{"_id":"636f526a6cd69d9a36ff2b53","avatarUrl":"/avatars/8f2271a193fcac609d9be270552b5afa.svg","isPro":false,"fullname":"Qiguang Chen","user":"LightChen2333","type":"user"},"name":"Qiguang Chen","status":"claimed_verified","statusLastChangedAt":"2026-01-12T10:37:43.958Z","hidden":false},{"_id":"690d5b51ad2597bf6c464cb0","name":"Tianyi Liang","hidden":false},{"_id":"690d5b51ad2597bf6c464cb1","name":"Xiaomeng Hu","hidden":false},{"_id":"690d5b51ad2597bf6c464cb2","name":"Yining Zheng","hidden":false},{"_id":"690d5b51ad2597bf6c464cb3","name":"Xinchi Chen","hidden":false},{"_id":"690d5b51ad2597bf6c464cb4","name":"Jun Zhao","hidden":false},{"_id":"690d5b51ad2597bf6c464cb5","name":"Xuanjing Huang","hidden":false},{"_id":"690d5b51ad2597bf6c464cb6","name":"Xipeng Qiu","hidden":false}],"publishedAt":"2025-11-06T17:25:23.000Z","submittedOnDailyAt":"2025-11-07T00:18:17.549Z","title":"Thinking with Video: Video Generation as a Promising Multimodal\n Reasoning Paradigm","submittedOnDailyBy":{"_id":"6690e13ccbcaf7ab0ec1c971","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/e8KDV6J29tviXlIpLZPq6.png","isPro":false,"fullname":"Tony.Li","user":"lkdhy","type":"user"},"summary":"\"Thinking with Text\" and \"Thinking with Images\" paradigm significantly\nimprove the reasoning ability of large language models (LLMs) and Vision\nLanguage Models (VLMs). However, these paradigms have inherent limitations. (1)\nImages capture only single moments and fail to represent dynamic processes or\ncontinuous changes, and (2) The separation of text and vision as distinct\nmodalities, hindering unified multimodal understanding and generation. To\novercome these limitations, we introduce \"Thinking with Video\", a new paradigm\nthat leverages video generation models, such as Sora-2, to bridge visual and\ntextual reasoning in a unified temporal framework. To support this exploration,\nwe developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench\nencompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing\nPuzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our\nevaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks,\nSora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even\nsurpasses VLMs on several tasks, such as Eyeballing Games. On text-centric\ntasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.\nFurthermore, we systematically analyse the source of these abilities. We also\nfind that self-consistency and in-context learning can improve Sora-2's\nperformance. In summary, our findings demonstrate that the video generation\nmodel is the potential unified multimodal understanding and generation model,\npositions \"thinking with video\" as a unified multimodal reasoning paradigm.","upvotes":239,"discussionId":"690d5b51ad2597bf6c464cb7","projectPage":"https://thinking-with-video.github.io/","githubRepo":"https://github.com/tongjingqi/Thinking-with-Video","githubRepoAddedBy":"user","ai_summary":"The \"Thinking with Video\" paradigm enhances multimodal reasoning by integrating video generation models, demonstrated through the Video Thinking Benchmark and improved performance on both vision and text tasks.","ai_keywords":["Thinking with Text","Thinking with Images","large language models","Vision Language Models","Thinking with Video","video generation models","Video Thinking Benchmark","vision-centric tasks","text-centric tasks","Eyeballing Puzzles","GSM8K","MMMU","self-consistency","in-context learning"],"githubStars":257,"organization":{"_id":"613b0dee83ec35d460684607","name":"OpenMOSS-Team","fullname":"OpenMOSS","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61457b8deff2c9fdb4de4988/N5b9663zQ4uq5_OTNlnmw.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6690e13ccbcaf7ab0ec1c971","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/e8KDV6J29tviXlIpLZPq6.png","isPro":false,"fullname":"Tony.Li","user":"lkdhy","type":"user"},{"_id":"636f526a6cd69d9a36ff2b53","avatarUrl":"/avatars/8f2271a193fcac609d9be270552b5afa.svg","isPro":false,"fullname":"Qiguang Chen","user":"LightChen2333","type":"user"},{"_id":"64a7e7b2ef22f9c793e01454","avatarUrl":"/avatars/cd3706ffedbf68f58a8e53046008b7fb.svg","isPro":false,"fullname":"tongjingqi(SII)","user":"tongjingqi","type":"user"},{"_id":"611a2731f2560d2024b4afd2","avatarUrl":"/avatars/294fa218d43cd206fbc2a6c49ef38820.svg","isPro":false,"fullname":"huxiaomeng","user":"gregH","type":"user"},{"_id":"65e4134472e748aae53e24f3","avatarUrl":"/avatars/3346f4f4cdbffc4c51276be01a6c5f10.svg","isPro":false,"fullname":"Mingzhe Li","user":"Mubuky","type":"user"},{"_id":"64cb54da1af278541d663708","avatarUrl":"/avatars/c44507cc92bb2e83154bad31b90ce6dd.svg","isPro":false,"fullname":"Xiaoye Qu","user":"Xiaoye08","type":"user"},{"_id":"683029a472a7ad4f6382d490","avatarUrl":"/avatars/f7fae0899f2dfc468e2848b5569e23b3.svg","isPro":false,"fullname":"Yurong Mou","user":"1v1-j12","type":"user"},{"_id":"64f033ef82c6eea604c4da8b","avatarUrl":"/avatars/51b93fea7fd68b4274ee03701245dcca.svg","isPro":false,"fullname":"Xiaoran Liu (SII)","user":"SII-xrliu","type":"user"},{"_id":"65ab2dd614d782df061265cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ab2dd614d782df061265cd/7T8kMx0wFNTa5zsVQrnOr.jpeg","isPro":false,"fullname":"Yongzhuo Yang","user":"YangYongzhuo","type":"user"},{"_id":"653dd16277c2f09452ad37cd","avatarUrl":"/avatars/a95f9527722845a5414d86180c8e945d.svg","isPro":false,"fullname":"Yunzhuo Hao","user":"luckychao","type":"user"},{"_id":"6458af46f4d212d780bd7c68","avatarUrl":"/avatars/832fd34bcc041b0b7b551873a459fc3c.svg","isPro":false,"fullname":"Wei Liu","user":"PeterV09","type":"user"},{"_id":"63f3502a520c14618925825a","avatarUrl":"/avatars/e986a2a6625e7be6890616a417f908d2.svg","isPro":false,"fullname":"Yafu Li","user":"yaful","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1,"organization":{"_id":"613b0dee83ec35d460684607","name":"OpenMOSS-Team","fullname":"OpenMOSS","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61457b8deff2c9fdb4de4988/N5b9663zQ4uq5_OTNlnmw.png"}}">
Papers
arxiv:2511.04570

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Published on Nov 6, 2025
· Submitted by
Tony.Li
on Nov 7, 2025
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,

Abstract

The "Thinking with Video" paradigm enhances multimodal reasoning by integrating video generation models, demonstrated through the Video Thinking Benchmark and improved performance on both vision and text tasks.

AI-generated summary

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.

Community

Paper submitter
•
edited Nov 7, 2025

We introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). Our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a unified multimodal reasoning paradigm.

·

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/thinking-with-video-video-generation-as-a-promising-multimodal-reasoning-paradigm-2944-57971210

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.04570 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.04570 in a Space README.md to link it from this page.

Collections including this paper 18