$\"image\"$

\n","updatedAt":"2025-11-11T17:22:07.961Z","author":{"_id":"629e1b71bb6419817ed7566c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629e1b71bb6419817ed7566c/0ZCt-11eQtRDCOk9AozOp.jpeg","fullname":"Huck Yang","name":"huckiyang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7181824445724487},"editors":["huckiyang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/629e1b71bb6419817ed7566c/0ZCt-11eQtRDCOk9AozOp.jpeg"],"reactions":[],"isReport":false}},{"id":"6913e4a71e7ec46276378eb4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-11-12T01:36:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis](https://huggingface.co/papers/2509.23652) (2025)\n* [Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation](https://huggingface.co/papers/2510.20812) (2025)\n* [SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models](https://huggingface.co/papers/2509.15661) (2025)\n* [Composition-Grounded Instruction Synthesis for Visual Reasoning](https://huggingface.co/papers/2510.15040) (2025)\n* [SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models](https://huggingface.co/papers/2510.08531) (2025)\n* [ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning](https://huggingface.co/papers/2510.27492) (2025)\n* [Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs](https://huggingface.co/papers/2510.21807) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-11-12T01:36:39.823Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.721553385257721},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2511.05705","authors":[{"_id":"6913700aa644ba07c499c928","name":"David Acuna","hidden":false},{"_id":"6913700aa644ba07c499c929","name":"Chao-Han Huck Yang","hidden":false},{"_id":"6913700aa644ba07c499c92a","name":"Yuntian Deng","hidden":false},{"_id":"6913700aa644ba07c499c92b","user":{"_id":"61703fa3dff0ef663e421ab5","avatarUrl":"/avatars/96172e2782e218bbbddfdf47f96c1ad4.svg","isPro":false,"fullname":"Jaehun Jung","user":"Jaehun","type":"user"},"name":"Jaehun Jung","status":"claimed_verified","statusLastChangedAt":"2025-11-11T19:41:00.690Z","hidden":false},{"_id":"6913700aa644ba07c499c92c","name":"Ximing Lu","hidden":false},{"_id":"6913700aa644ba07c499c92d","name":"Prithviraj Ammanabrolu","hidden":false},{"_id":"6913700aa644ba07c499c92e","name":"Hyunwoo Kim","hidden":false},{"_id":"6913700aa644ba07c499c92f","name":"Yuan-Hong Liao","hidden":false},{"_id":"6913700aa644ba07c499c930","name":"Yejin Choi","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/629e1b71bb6419817ed7566c/O-fcFejizuqy5q8tYQhdx.png"],"publishedAt":"2025-11-07T20:50:54.000Z","submittedOnDailyAt":"2025-11-11T14:52:07.954Z","title":"Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale","submittedOnDailyBy":{"_id":"629e1b71bb6419817ed7566c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629e1b71bb6419817ed7566c/0ZCt-11eQtRDCOk9AozOp.jpeg","isPro":false,"fullname":"Huck Yang","user":"huckiyang","type":"user"},"summary":"Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.","upvotes":8,"discussionId":"6913700aa644ba07c499c931","ai_summary":"A new reasoning data generation framework creates a large-scale vision-centric dataset with over 1M synthetic questions, enhancing performance across various benchmarks and improving cross-modality transfer.","ai_keywords":["multimodal reasoning","reasoning data generation framework","synthetic vision-centric questions","preference data","instruction prompts","offline RL","online RL","VLMs","reasoning LLMs","CoT traces","finetuning","Qwen2.5-VL-7B","MiMo-VL-7B-RL","V* Bench","CV-Bench","MMStar-V","MMLU-Pro","MMAU","NiEH","SFT","staged offline RL","out-of-domain","cross-modality transfer"],"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1613114437487-60262a8e0703121c822a80b6.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"629e1b71bb6419817ed7566c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629e1b71bb6419817ed7566c/0ZCt-11eQtRDCOk9AozOp.jpeg","isPro":false,"fullname":"Huck Yang","user":"huckiyang","type":"user"},{"_id":"61703fa3dff0ef663e421ab5","avatarUrl":"/avatars/96172e2782e218bbbddfdf47f96c1ad4.svg","isPro":false,"fullname":"Jaehun Jung","user":"Jaehun","type":"user"},{"_id":"67ad9767d3a5cc6789882e10","avatarUrl":"/avatars/e5d16dc828670963ffe3bf2cf33318ab.svg","isPro":false,"fullname":"Kim Jiwan","user":"JiwanKim","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","isPro":false,"fullname":"Yury Panikov","user":"panikov","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"645abb43c4acfcf6640270e3","avatarUrl":"/avatars/e65f8de9b9fe878bd8e9106a802c6f21.svg","isPro":false,"fullname":"David Acuna","user":"davidjesusacu","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1613114437487-60262a8e0703121c822a80b6.png"}}">

Papers

arxiv:2511.05705

Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

Published on Nov 7, 2025

· Submitted by

Huck Yang on Nov 11, 2025

NVIDIA

Upvote

Authors:

Jaehun Jung ,

Abstract

A new reasoning data generation framework creates a large-scale vision-centric dataset with over 1M synthetic questions, enhancing performance across various benchmarks and improving cross-modality transfer.

AI-generated summary

Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.