🤗 Dataset: https://huggingface.co/datasets/We-Math/V-Interaction-400K\n
\n\t\n\t\t\n\t\n\t\n\t\t💡 Overview\n\t\n
\nV-Thinker is a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning.
Unlike traditional vision-language models that passively process visual inputs, V-Thinker actively interacts with images — editing, annotating, and transforming them to simplify complex problems and achieve reasoning grounded in perception and logic.
To address the limited diversity and scalability of existing visual reasoning datasets, we rethink the traditional data synthesis paradigm by transforming models from “solvers” to “creators.”
This establishes a new vision-centric data synthesis framework that empowers models to autonomously generate high-quality, diverse, and knowledge-grounded multimodal reasoning data.
Built upon this foundation, V-Thinker integrates a unified post-training paradigm that combines data evolution, perception alignment, and interactive reasoning into a coherent pipeline for advancing vision-centric reasoning.
Representative examples of V-Thinker's knowledge-driven synthesis spanning diverse reasoning domains.
Complete interactive reasoning samples of V-Thinker on open-source benchmarks.
\n\n","updatedAt":"2025-11-07T02:46:13.733Z","author":{"_id":"6683a05e74fb1736a4b7c934","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6683a05e74fb1736a4b7c934/eiz6qlqIUjAWGy5zfg8Cs.jpeg","fullname":"QRQ","name":"RichardQRQ","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5382198095321655},"editors":["RichardQRQ"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6683a05e74fb1736a4b7c934/eiz6qlqIUjAWGy5zfg8Cs.jpeg"],"reactions":[{"reaction":"🔥","users":["taesiri","nanatata","RichardQRQ"],"count":3}],"isReport":false}},{"id":"690d5d9f9f252aa8979fb7df","author":{"_id":"6683a05e74fb1736a4b7c934","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6683a05e74fb1736a4b7c934/eiz6qlqIUjAWGy5zfg8Cs.jpeg","fullname":"QRQ","name":"RichardQRQ","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false},"createdAt":"2025-11-07T02:46:55.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Visualization of the evolved knowledge system through the Data Evolution Flywheel.\n\n","html":"Visualization of the evolved knowledge system through the Data Evolution Flywheel.
Qualitative analysis of V-Thinker-7B on vision-centric interactive reasoning tasks.
\n\n","updatedAt":"2025-11-07T02:48:59.106Z","author":{"_id":"6683a05e74fb1736a4b7c934","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6683a05e74fb1736a4b7c934/eiz6qlqIUjAWGy5zfg8Cs.jpeg","fullname":"QRQ","name":"RichardQRQ","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5091080665588379},"editors":["RichardQRQ"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6683a05e74fb1736a4b7c934/eiz6qlqIUjAWGy5zfg8Cs.jpeg"],"reactions":[{"reaction":"🔥","users":["taesiri","nanatata","RichardQRQ"],"count":3}],"isReport":false}},{"id":"690e9e435879af22162870ce","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-11-08T01:34:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning](https://huggingface.co/papers/2509.25866) (2025)\n* [Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models](https://huggingface.co/papers/2510.01304) (2025)\n* [CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images](https://huggingface.co/papers/2510.11718) (2025)\n* [ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model](https://huggingface.co/papers/2510.24285) (2025)\n* [SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models](https://huggingface.co/papers/2510.08531) (2025)\n* [MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning](https://huggingface.co/papers/2510.14958) (2025)\n* [Video-Thinker: Sparking\"Thinking with Videos\"via Reinforcement Learning](https://huggingface.co/papers/2510.23473) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\nThe following papers were recommended by the Semantic Scholar API
\n- \n
- DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning (2025) \n
- Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models (2025) \n
- CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images (2025) \n
- ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model (2025) \n
- SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models (2025) \n
- MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning (2025) \n
- Video-Thinker: Sparking\"Thinking with Videos\"via Reinforcement Learning (2025) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/v-thinker-interactive-thinking-with-images
\n","updatedAt":"2025-11-12T21:03:47.315Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7937107682228088},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[{"reaction":"🤗","users":["taesiri"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2511.04460","authors":[{"_id":"690d5b2aad2597bf6c464c9a","user":{"_id":"6683a05e74fb1736a4b7c934","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6683a05e74fb1736a4b7c934/eiz6qlqIUjAWGy5zfg8Cs.jpeg","isPro":false,"fullname":"QRQ","user":"RichardQRQ","type":"user"},"name":"Runqi Qiao","status":"claimed_verified","statusLastChangedAt":"2025-11-10T09:32:03.684Z","hidden":false},{"_id":"690d5b2aad2597bf6c464c9b","name":"Qiuna Tan","hidden":false},{"_id":"690d5b2aad2597bf6c464c9c","name":"Minghan Yang","hidden":false},{"_id":"690d5b2aad2597bf6c464c9d","name":"Guanting Dong","hidden":false},{"_id":"690d5b2aad2597bf6c464c9e","name":"Peiqing Yang","hidden":false},{"_id":"690d5b2aad2597bf6c464c9f","user":{"_id":"65276b6bd4670b0875298815","avatarUrl":"/avatars/65c68f4e1d208705283c04b1ef63a13e.svg","isPro":false,"fullname":"Bruceq","user":"langshiqiang","type":"user"},"name":"Shiqiang Lang","status":"claimed_verified","statusLastChangedAt":"2025-12-01T16:25:03.487Z","hidden":false},{"_id":"690d5b2aad2597bf6c464ca0","name":"Enhui Wan","hidden":false},{"_id":"690d5b2aad2597bf6c464ca1","name":"Xiaowan Wang","hidden":false},{"_id":"690d5b2aad2597bf6c464ca2","name":"Yida Xu","hidden":false},{"_id":"690d5b2aad2597bf6c464ca3","name":"Lan Yang","hidden":false},{"_id":"690d5b2aad2597bf6c464ca4","name":"Chong Sun","hidden":false},{"_id":"690d5b2aad2597bf6c464ca5","name":"Chen Li","hidden":false},{"_id":"690d5b2aad2597bf6c464ca6","name":"Honggang Zhang","hidden":false}],"publishedAt":"2025-11-06T15:32:29.000Z","submittedOnDailyAt":"2025-11-07T00:09:41.943Z","title":"V-Thinker: Interactive Thinking with Images","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"Empowering Large Multimodal Models (LMMs) to deeply integrate image\ninteraction with long-horizon reasoning capabilities remains a long-standing\nchallenge in this field. Recent advances in vision-centric reasoning explore a\npromising \"Thinking with Images\" paradigm for LMMs, marking a shift from\nimage-assisted reasoning to image-interactive thinking. While this milestone\nenables models to focus on fine-grained image regions, progress remains\nconstrained by limited visual tool spaces and task-specific workflow designs.\nTo bridge this gap, we present V-Thinker, a general-purpose multimodal\nreasoning assistant that enables interactive, vision-centric thinking through\nend-to-end reinforcement learning. V-Thinker comprises two key components: (1)\na Data Evolution Flywheel that automatically synthesizes, evolves, and verifies\ninteractive reasoning datasets across three dimensions-diversity, quality, and\ndifficulty; and (2) a Visual Progressive Training Curriculum that first aligns\nperception via point-level supervision, then integrates interactive reasoning\nthrough a two-stage reinforcement learning framework. Furthermore, we introduce\nVTBench, an expert-verified benchmark targeting vision-centric interactive\nreasoning tasks. Extensive experiments demonstrate that V-Thinker consistently\noutperforms strong LMM-based baselines in both general and interactive\nreasoning scenarios, providing valuable insights for advancing\nimage-interactive reasoning applications.","upvotes":97,"discussionId":"690d5b2aad2597bf6c464ca7","githubRepo":"https://github.com/We-Math/V-Thinker","githubRepoAddedBy":"user","ai_summary":"V-Thinker, a multimodal reasoning assistant using reinforcement learning, enhances image-interactive thinking by synthesizing datasets and aligning perception for improved performance in vision-centric tasks.","ai_keywords":["multimodal models","image interaction","long-horizon reasoning","Thinking with Images","image-interactive thinking","end-to-end reinforcement learning","Data Evolution Flywheel","Visual Progressive Training Curriculum","point-level supervision","two-stage reinforcement learning","VTBench","vision-centric interactive reasoning"],"githubStars":168},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6683a05e74fb1736a4b7c934","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6683a05e74fb1736a4b7c934/eiz6qlqIUjAWGy5zfg8Cs.jpeg","isPro":false,"fullname":"QRQ","user":"RichardQRQ","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6684274046da9ec5d66f6570","avatarUrl":"/avatars/bd83cc8ac53a8abca7c9edf5b4700074.svg","isPro":false,"fullname":"forger","user":"aniya-forger","type":"user"},{"_id":"68721f531825dc4d3e52136d","avatarUrl":"/avatars/ae4c2ac7b400b6e6b52dc3a257c01e30.svg","isPro":false,"fullname":"Enhui Wan","user":"Enghui","type":"user"},{"_id":"662b19df4f711ee4e1f5bf9b","avatarUrl":"/avatars/d2a0eab67f71d0eab28dffa7214186eb.svg","isPro":false,"fullname":"Yida Xu","user":"yidada","type":"user"},{"_id":"65276b6bd4670b0875298815","avatarUrl":"/avatars/65c68f4e1d208705283c04b1ef63a13e.svg","isPro":false,"fullname":"Bruceq","user":"langshiqiang","type":"user"},{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","isPro":false,"fullname":"KABI","user":"dongguanting","type":"user"},{"_id":"6730269462930cbc4611dda7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/wdpk96XoIrPyNO6OCASVE.png","isPro":false,"fullname":"MingHan","user":"YanmHa","type":"user"},{"_id":"63eb2e7f91a1b8ec4fbd449b","avatarUrl":"/avatars/f27c19099ea8cb4b69c5008e7640dec4.svg","isPro":false,"fullname":"Richard_Joe","user":"Steven-BUPT","type":"user"},{"_id":"668379997e757a10563fc786","avatarUrl":"/avatars/944f199567d711382c22aa51f964398a.svg","isPro":false,"fullname":"one","user":"Fineone","type":"user"},{"_id":"66842b38c1be1cd1690efb94","avatarUrl":"/avatars/38df7cd7a3516f77dfd3e5d8cf83d9b3.svg","isPro":false,"fullname":"nanatata","user":"nanatata","type":"user"},{"_id":"6683110f549c1b932c7a7710","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6683110f549c1b932c7a7710/tFKh-_4MOVFhTHhBgcgOZ.png","isPro":false,"fullname":"We-Math","user":"We-Math","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">Abstract
V-Thinker, a multimodal reasoning assistant using reinforcement learning, enhances image-interactive thinking by synthesizing datasets and aligning perception for improved performance in vision-centric tasks.
Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
Community
V-Thinker is a general-purpose multimodal reasoning assistant that enables Interactive Thinking with Images through end-to-end reinforcement learning. Unlike traditional vision-language models, V-Thinker actively interacts with visual content—editing, annotating, and transforming images to simplify complex problems.
💻 Github: https://github.com/We-Math/V-Thinker
🤗 Dataset: https://huggingface.co/datasets/We-Math/V-Interaction-400K
💡 Overview
V-Thinker is a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning.
Unlike traditional vision-language models that passively process visual inputs, V-Thinker actively interacts with images — editing, annotating, and transforming them to simplify complex problems and achieve reasoning grounded in perception and logic.
To address the limited diversity and scalability of existing visual reasoning datasets, we rethink the traditional data synthesis paradigm by transforming models from “solvers” to “creators.”
This establishes a new vision-centric data synthesis framework that empowers models to autonomously generate high-quality, diverse, and knowledge-grounded multimodal reasoning data.
Built upon this foundation, V-Thinker integrates a unified post-training paradigm that combines data evolution, perception alignment, and interactive reasoning into a coherent pipeline for advancing vision-centric reasoning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning (2025)
- Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models (2025)
- CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images (2025)
- ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model (2025)
- SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models (2025)
- MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning (2025)
- Video-Thinker: Sparking"Thinking with Videos"via Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/v-thinker-interactive-thinking-with-images
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper