Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
[go: Go Back, main page]

https://huggingface.co/russwang/ThinkLite-VL-7B
ThinkLite-VL-70K: https://huggingface.co/datasets/russwang/ThinkLite-VL-70k
ThinkLite-VL-Hard-11K: https://huggingface.co/datasets/russwang/ThinkLite-VL-Hard-11k

\n","updatedAt":"2025-04-11T03:27:16.481Z","author":{"_id":"655fed9fdef5905d38b84af3","avatarUrl":"/avatars/2cda4182dfd11a1e94743639e62328ea.svg","fullname":"Xiyao Wang","name":"russwang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.23836512863636017},"editors":["russwang"],"editorAvatarUrls":["/avatars/2cda4182dfd11a1e94743639e62328ea.svg"],"reactions":[],"isReport":false}},{"id":"67f9c320d81ece8dc1affdd0","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-12T01:34:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749) (2025)\n* [OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement](https://huggingface.co/papers/2503.17352) (2025)\n* [Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1](https://huggingface.co/papers/2503.24376) (2025)\n* [R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2503.05592) (2025)\n* [MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning](https://huggingface.co/papers/2503.07365) (2025)\n* [Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering](https://huggingface.co/papers/2503.11197) (2025)\n* [UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning](https://huggingface.co/papers/2503.21620) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-12T01:34:24.868Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7229174971580505},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.07934","authors":[{"_id":"67f88bbaf1410163f7c3b68a","user":{"_id":"655fed9fdef5905d38b84af3","avatarUrl":"/avatars/2cda4182dfd11a1e94743639e62328ea.svg","isPro":false,"fullname":"Xiyao Wang","user":"russwang","type":"user"},"name":"Xiyao Wang","status":"admin_assigned","statusLastChangedAt":"2025-04-11T08:32:46.460Z","hidden":false},{"_id":"67f88bbaf1410163f7c3b68b","user":{"_id":"630713411801ecc7d2592a7c","avatarUrl":"/avatars/fb36f69f03421c3a2a7f72ba0858fa60.svg","isPro":true,"fullname":"Zhengyuan Yang","user":"zyang39","type":"user"},"name":"Zhengyuan Yang","status":"admin_assigned","statusLastChangedAt":"2025-04-11T08:32:53.708Z","hidden":false},{"_id":"67f88bbaf1410163f7c3b68c","user":{"_id":"645ab0b7c266796265baefa4","avatarUrl":"/avatars/bdac661996b63c4b2a56881707afa01f.svg","isPro":false,"fullname":"Chao Feng","user":"chfeng","type":"user"},"name":"Chao Feng","status":"claimed_verified","statusLastChangedAt":"2025-04-13T19:25:33.292Z","hidden":false},{"_id":"67f88bbaf1410163f7c3b68d","name":"Hongjin Lu","hidden":false},{"_id":"67f88bbaf1410163f7c3b68e","user":{"_id":"63db16fff03c3d71ef397206","avatarUrl":"/avatars/bfb7e0d730b7d03302799d5d2828d97d.svg","isPro":false,"fullname":"Linjie Li","user":"linjieli222","type":"user"},"name":"Linjie Li","status":"admin_assigned","statusLastChangedAt":"2025-04-11T08:33:27.401Z","hidden":false},{"_id":"67f88bbaf1410163f7c3b68f","user":{"_id":"6669f406d21cf7d39b1d98ba","avatarUrl":"/avatars/5ec270cb8bcb68786379d8a8bc411aaa.svg","isPro":false,"fullname":"CC Lin","user":"cclin10","type":"user"},"name":"Chung-Ching Lin","status":"claimed_verified","statusLastChangedAt":"2025-07-23T08:38:09.178Z","hidden":false},{"_id":"67f88bbaf1410163f7c3b690","user":{"_id":"6298fd95b58e71e2ac9f3ad8","avatarUrl":"/avatars/7d34644d537bc5c17cf1e4ce4095355c.svg","isPro":false,"fullname":"Kevin Lin","user":"kevinlin311tw","type":"user"},"name":"Kevin Lin","status":"admin_assigned","statusLastChangedAt":"2025-04-11T08:33:58.928Z","hidden":false},{"_id":"67f88bbaf1410163f7c3b691","user":{"_id":"64cbc3e2a257a3212c00a115","avatarUrl":"/avatars/836e61be4aeda2080ddf2db9f2626cc6.svg","isPro":false,"fullname":"Furong Huang Lab at UMD","user":"furongh-lab","type":"user"},"name":"Furong Huang","status":"admin_assigned","statusLastChangedAt":"2025-04-11T08:34:07.020Z","hidden":false},{"_id":"67f88bbaf1410163f7c3b692","user":{"_id":"6413521d4e5305c14f22e110","avatarUrl":"/avatars/a6f8d0573e678f79bc3c0b7897b818ce.svg","isPro":false,"fullname":"Lijuan Wang","user":"Lijuan","type":"user"},"name":"Lijuan Wang","status":"admin_assigned","statusLastChangedAt":"2025-04-11T08:34:25.295Z","hidden":false}],"publishedAt":"2025-04-10T17:49:05.000Z","submittedOnDailyAt":"2025-04-11T01:57:16.470Z","title":"SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual\n Reasoning Self-Improvement","submittedOnDailyBy":{"_id":"655fed9fdef5905d38b84af3","avatarUrl":"/avatars/2cda4182dfd11a1e94743639e62328ea.svg","isPro":false,"fullname":"Xiyao Wang","user":"russwang","type":"user"},"summary":"In this paper, we present an effective method to enhance visual reasoning\nwith significantly fewer training samples, relying purely on self-improvement\nwith no knowledge distillation. Our key insight is that the difficulty of\ntraining data during reinforcement fine-tuning (RFT) is critical. Appropriately\nchallenging samples can substantially boost reasoning capabilities even when\nthe dataset is small. Despite being intuitive, the main challenge remains in\naccurately quantifying sample difficulty to enable effective data filtering. To\nthis end, we propose a novel way of repurposing Monte Carlo Tree Search (MCTS)\nto achieve that. Starting from our curated 70k open-source training samples, we\nintroduce an MCTS-based selection method that quantifies sample difficulty\nbased on the number of iterations required by the VLMs to solve each problem.\nThis explicit step-by-step reasoning in MCTS enforces the model to think longer\nand better identifies samples that are genuinely challenging. We filter and\nretain 11k samples to perform RFT on Qwen2.5-VL-7B-Instruct, resulting in our\nfinal model, ThinkLite-VL. Evaluation results on eight benchmarks show that\nThinkLite-VL improves the average performance of Qwen2.5-VL-7B-Instruct by 7%,\nusing only 11k training samples with no knowledge distillation. This\nsignificantly outperforms all existing 7B-level reasoning VLMs, and our fairly\ncomparable baselines that use classic selection methods such as accuracy-based\nfiltering. Notably, on MathVista, ThinkLite-VL-7B achieves the SoTA accuracy of\n75.1, surpassing Qwen2.5-VL-72B, GPT-4o, and O1. Our code, data, and model are\navailable at https://github.com/si0wang/ThinkLite-VL.","upvotes":21,"discussionId":"67f88bbbf1410163f7c3b6f4","githubRepo":"https://github.com/si0wang/ThinkLite-VL","githubRepoAddedBy":"user","ai_summary":"A method using Monte Carlo Tree Search for selecting challenging training samples enhances visual reasoning capabilities in a smaller dataset compared to existing models.","ai_keywords":["reinforcement fine-tuning","Monte Carlo Tree Search","sample difficulty","VLMs","visual reasoning","Qwen2.5-VL-7B-Instruct","ThinkLite-VL","MathVista","SoTA accuracy"],"githubStars":107},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"655fed9fdef5905d38b84af3","avatarUrl":"/avatars/2cda4182dfd11a1e94743639e62328ea.svg","isPro":false,"fullname":"Xiyao Wang","user":"russwang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"630713411801ecc7d2592a7c","avatarUrl":"/avatars/fb36f69f03421c3a2a7f72ba0858fa60.svg","isPro":true,"fullname":"Zhengyuan Yang","user":"zyang39","type":"user"},{"_id":"64887eb15cf73a16e767b56a","avatarUrl":"/avatars/ada2b6a07346b1d61322ddd04d219318.svg","isPro":false,"fullname":"Yuhang Zhou","user":"zyhang1998","type":"user"},{"_id":"62505101a0f6b0ed18114323","avatarUrl":"/avatars/fdd0cd6abba33740b037b71876e8af41.svg","isPro":false,"fullname":"Paiheng Xu","user":"paiheng","type":"user"},{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},{"_id":"61b3576de49318df54457d8f","avatarUrl":"/avatars/5d425bea6ee6319d413e965b4499ec5c.svg","isPro":false,"fullname":"Biziel","user":"Grzegorz","type":"user"},{"_id":"6669f406d21cf7d39b1d98ba","avatarUrl":"/avatars/5ec270cb8bcb68786379d8a8bc411aaa.svg","isPro":false,"fullname":"CC Lin","user":"cclin10","type":"user"},{"_id":"645ab0b7c266796265baefa4","avatarUrl":"/avatars/bdac661996b63c4b2a56881707afa01f.svg","isPro":false,"fullname":"Chao Feng","user":"chfeng","type":"user"},{"_id":"637bb90c1f3d18b9ce7d1c5b","avatarUrl":"/avatars/a64c605a73b980da801c521b862026f4.svg","isPro":false,"fullname":"John Gil Cubas","user":"JohnRoger","type":"user"},{"_id":"6751ae913f59c62f77583757","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/_dCyD7vTFPlFETR97WyQj.png","isPro":false,"fullname":"Ivy Zhang","user":"Ivy1997","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2504.07934

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

Published on Apr 10, 2025
· Submitted by
Xiyao Wang
on Apr 11, 2025

Abstract

A method using Monte Carlo Tree Search for selecting challenging training samples enhances visual reasoning capabilities in a smaller dataset compared to existing models.

AI-generated summary

In this paper, we present an effective method to enhance visual reasoning with significantly fewer training samples, relying purely on self-improvement with no knowledge distillation. Our key insight is that the difficulty of training data during reinforcement fine-tuning (RFT) is critical. Appropriately challenging samples can substantially boost reasoning capabilities even when the dataset is small. Despite being intuitive, the main challenge remains in accurately quantifying sample difficulty to enable effective data filtering. To this end, we propose a novel way of repurposing Monte Carlo Tree Search (MCTS) to achieve that. Starting from our curated 70k open-source training samples, we introduce an MCTS-based selection method that quantifies sample difficulty based on the number of iterations required by the VLMs to solve each problem. This explicit step-by-step reasoning in MCTS enforces the model to think longer and better identifies samples that are genuinely challenging. We filter and retain 11k samples to perform RFT on Qwen2.5-VL-7B-Instruct, resulting in our final model, ThinkLite-VL. Evaluation results on eight benchmarks show that ThinkLite-VL improves the average performance of Qwen2.5-VL-7B-Instruct by 7%, using only 11k training samples with no knowledge distillation. This significantly outperforms all existing 7B-level reasoning VLMs, and our fairly comparable baselines that use classic selection methods such as accuracy-based filtering. Notably, on MathVista, ThinkLite-VL-7B achieves the SoTA accuracy of 75.1, surpassing Qwen2.5-VL-72B, GPT-4o, and O1. Our code, data, and model are available at https://github.com/si0wang/ThinkLite-VL.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.07934 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.07934 in a Space README.md to link it from this page.

Collections including this paper 8