Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-12-04T01:40:21.112Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7230742573738098},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"693125f2fa0ddaaad8b374e0","author":{"_id":"652b7fc6756a15d750fcba7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652b7fc6756a15d750fcba7b/B92dB4HruUXtBvTbQi3lD.jpeg","fullname":"Dongyuan Li","name":"coffee3699","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-12-04T06:10:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Great work, thansk a lot!","html":"

Great work, thansk a lot!

\n","updatedAt":"2025-12-04T06:10:58.689Z","author":{"_id":"652b7fc6756a15d750fcba7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652b7fc6756a15d750fcba7b/B92dB4HruUXtBvTbQi3lD.jpeg","fullname":"Dongyuan Li","name":"coffee3699","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.714651346206665},"editors":["coffee3699"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/652b7fc6756a15d750fcba7b/B92dB4HruUXtBvTbQi3lD.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2511.22586","authors":[{"_id":"692d22f24397b1ec214f690b","name":"Yifan Du","hidden":false},{"_id":"692d22f24397b1ec214f690c","name":"Kun Zhou","hidden":false},{"_id":"692d22f24397b1ec214f690d","name":"Yingqian Min","hidden":false},{"_id":"692d22f24397b1ec214f690e","name":"Yue Ling","hidden":false},{"_id":"692d22f24397b1ec214f690f","name":"Wayne Xin Zhao","hidden":false},{"_id":"692d22f24397b1ec214f6910","name":"Youbin Wu","hidden":false}],"publishedAt":"2025-11-27T16:19:34.000Z","submittedOnDailyAt":"2025-12-03T00:27:02.051Z","title":"Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization","submittedOnDailyBy":{"_id":"61d78857a21a9b49c7e8e4a9","avatarUrl":"/avatars/c7e7f84cad775be2d13fab8530bf21f5.svg","isPro":false,"fullname":"Yifan Du","user":"Richard1999","type":"user"},"summary":"We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as \"think with image\", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a \"short is long\" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.","upvotes":7,"discussionId":"692d22f24397b1ec214f6911","projectPage":"https://github.com/RUCAIBox/Revisiting-Visual-CoT","githubRepo":"https://github.com/Richar-Du/Revisiting-Visual-CoT","githubRepoAddedBy":"user","ai_summary":"Investigating different Chain-of-Thought designs in vision-language models reveals that concise grounding steps are most effective for improving generalizable visual reasoning across various tasks.","ai_keywords":["Chain-of-Thought","vision-language models","visual reasoning","CoT","Grounding CoT","Visual CoT","maze-solving benchmark","Qwen2.5-VL-7B","SFT-then-RL","visual reasoning tasks"],"githubStars":2,"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"61d78857a21a9b49c7e8e4a9","avatarUrl":"/avatars/c7e7f84cad775be2d13fab8530bf21f5.svg","isPro":false,"fullname":"Yifan Du","user":"Richard1999","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"646b43deb1202bc77c1024a4","avatarUrl":"/avatars/cf791574ab986bac274e7fbcf04e2a59.svg","isPro":false,"fullname":"hangyu guo","user":"Rosiness","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"67ad9767d3a5cc6789882e10","avatarUrl":"/avatars/e5d16dc828670963ffe3bf2cf33318ab.svg","isPro":false,"fullname":"Kim Jiwan","user":"JiwanKim","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}}">
Papers
arxiv:2511.22586

Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

Published on Nov 27, 2025
· Submitted by
Yifan Du
on Dec 3, 2025
Authors:
,
,
,
,
,

Abstract

Investigating different Chain-of-Thought designs in vision-language models reveals that concise grounding steps are most effective for improving generalizable visual reasoning across various tasks.

AI-generated summary

We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a "short is long" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.

Community

Paper submitter

We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a "short is long" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Great work, thansk a lot!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.22586 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.22586 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.22586 in a Space README.md to link it from this page.

Collections including this paper 3