Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
[go: Go Back, main page]

Paper
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.

\n","updatedAt":"2025-05-27T13:02:59.594Z","author":{"_id":"61bb00f6c4ac95d207b25f1b","avatarUrl":"/avatars/3b6eba701d64518d6f694942f5b2e9a9.svg","fullname":"Zongyang Ma","name":"zyma","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.8486011624336243},"editors":["zyma"],"editorAvatarUrls":["/avatars/3b6eba701d64518d6f694942f5b2e9a9.svg"],"reactions":[],"isReport":false}},{"id":"6836697b966c664cb545a67a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-05-28T01:40:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning](https://huggingface.co/papers/2505.17022) (2025)\n* [UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning](https://huggingface.co/papers/2505.14231) (2025)\n* [DeepEyes: Incentivizing\"Thinking with Images\"via Reinforcement Learning](https://huggingface.co/papers/2505.14362) (2025)\n* [VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning](https://huggingface.co/papers/2505.12434) (2025)\n* [G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning](https://huggingface.co/papers/2505.13426) (2025)\n* [GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning](https://huggingface.co/papers/2504.01886) (2025)\n* [VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model](https://huggingface.co/papers/2504.07615) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-05-28T01:40:11.758Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7182655334472656},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.15804","authors":[{"_id":"6835adccc9704b79b7f18dd0","user":{"_id":"65def3cfe8e604a7b6f39681","avatarUrl":"/avatars/c1ce791f5513d934c5f6426bd17e4fbd.svg","isPro":false,"fullname":"lzz","user":"lizongzhao","type":"user"},"name":"Zongzhao Li","status":"admin_assigned","statusLastChangedAt":"2025-05-27T12:49:37.146Z","hidden":false},{"_id":"6835adccc9704b79b7f18dd1","user":{"_id":"61bb00f6c4ac95d207b25f1b","avatarUrl":"/avatars/3b6eba701d64518d6f694942f5b2e9a9.svg","isPro":false,"fullname":"Zongyang Ma","user":"zyma","type":"user"},"name":"Zongyang Ma","status":"admin_assigned","statusLastChangedAt":"2025-05-27T12:49:26.103Z","hidden":false},{"_id":"6835adccc9704b79b7f18dd2","name":"Mingze Li","hidden":false},{"_id":"6835adccc9704b79b7f18dd3","user":{"_id":"67d92d6061f369297a4a225a","avatarUrl":"/avatars/af71c847d8122c827b22ce52d4c5af71.svg","isPro":false,"fullname":"Songyou Li","user":"Wthinker","type":"user"},"name":"Songyou Li","status":"admin_assigned","statusLastChangedAt":"2025-05-27T12:49:52.377Z","hidden":false},{"_id":"6835adccc9704b79b7f18dd4","user":{"_id":"642eecbf9b2484d7d8526781","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642eecbf9b2484d7d8526781/4IvGbd66s49Wx5pZyZGHA.png","isPro":false,"fullname":"Yu Rong","user":"Swrooy","type":"user"},"name":"Yu Rong","status":"claimed_verified","statusLastChangedAt":"2025-06-10T09:29:51.963Z","hidden":false},{"_id":"6835adccc9704b79b7f18dd5","user":{"_id":"67a5a25269f568c7eb4173cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/IFzcHm_K8s2UxTRCC79Xf.png","isPro":false,"fullname":"Tingyang Xu","user":"xuty007","type":"user"},"name":"Tingyang Xu","status":"admin_assigned","statusLastChangedAt":"2025-05-27T12:49:58.919Z","hidden":false},{"_id":"6835adccc9704b79b7f18dd6","name":"Ziqi Zhang","hidden":false},{"_id":"6835adccc9704b79b7f18dd7","name":"Deli Zhao","hidden":false},{"_id":"6835adccc9704b79b7f18dd8","name":"Wenbing Huang","hidden":false}],"publishedAt":"2025-05-21T17:57:38.000Z","submittedOnDailyAt":"2025-05-27T11:11:13.707Z","title":"STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs","submittedOnDailyBy":{"_id":"61bb00f6c4ac95d207b25f1b","avatarUrl":"/avatars/3b6eba701d64518d6f694942f5b2e9a9.svg","isPro":false,"fullname":"Zongyang Ma","user":"zyma","type":"user"},"summary":"Multimodal Large Language Models (MLLMs) have demonstrated remarkable\ncapabilities across diverse tasks, yet they lag significantly behind humans in\nspatial reasoning. We investigate this gap through Transformation-Driven Visual\nReasoning (TVR), a challenging task requiring identification of object\ntransformations across images under varying viewpoints. While traditional\nSupervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in\ncross-view settings, sparse-reward Reinforcement Learning (RL) suffers from\ninefficient exploration and slow convergence. To address these limitations, we\npropose STAR-R1, a novel framework that integrates a single-stage RL paradigm\nwith a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1\nrewards partial correctness while penalizing excessive enumeration and passive\ninaction, enabling efficient exploration and precise reasoning. Comprehensive\nevaluations demonstrate that STAR-R1 achieves state-of-the-art performance\nacross all 11 metrics, outperforming SFT by 23% in cross-view scenarios.\nFurther analysis reveals STAR-R1's anthropomorphic behavior and highlights its\nunique ability to compare all objects for improving spatial reasoning. Our work\nprovides critical insights in advancing the research of MLLMs and reasoning\nmodels. The codes, model weights, and data will be publicly available at\nhttps://github.com/zongzhao23/STAR-R1.","upvotes":10,"discussionId":"6835adcdc9704b79b7f18e23","githubRepo":"https://github.com/zongzhao23/STAR-R1","githubRepoAddedBy":"user","ai_summary":"STAR-R1, a novel RL framework with a fine-grained reward mechanism, enhances spatial reasoning in multimodal large language models by addressing limitations in traditional SFT and sparse-reward RL.","ai_keywords":["Transformation-Driven Visual Reasoning","Supervised Fine-Tuning","Reinforcement Learning","single-stage RL","fine-grained reward mechanism","partial correctness","excessive enumeration","spatial reasoning","multimodal large language models"],"githubStars":11},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"61bb00f6c4ac95d207b25f1b","avatarUrl":"/avatars/3b6eba701d64518d6f694942f5b2e9a9.svg","isPro":false,"fullname":"Zongyang Ma","user":"zyma","type":"user"},{"_id":"670f86e4d75f114352916a35","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g3zZ6TTbgV-Xu789lPaYN.png","isPro":false,"fullname":"Li","user":"zongzhao","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"669f3b098c65c172c4d64039","avatarUrl":"/avatars/d85158964853ab87b9b677fa16df90f8.svg","isPro":false,"fullname":"Yuxin Chen","user":"Uasonchen","type":"user"},{"_id":"66b091f3bfb4316422b07303","avatarUrl":"/avatars/b226027ca9aa4ffc734c891db46e3621.svg","isPro":false,"fullname":"Sophia Wilson","user":"AnnaCute","type":"user"},{"_id":"67f71617aa000433e0ecf837","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/CBmhQWqu6fCxLb7Eomb9o.png","isPro":false,"fullname":"zeronine","user":"zero9labs","type":"user"},{"_id":"6358edff3b3638bdac83f7ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666772404424-noauth.jpeg","isPro":false,"fullname":"Pratyay Banerjee","user":"Neilblaze","type":"user"},{"_id":"68410e4e5de370509548f0ed","avatarUrl":"/avatars/ec53b49b3de495003a0d05e137586326.svg","isPro":false,"fullname":"Jiacheng Cen","user":"Chewxq","type":"user"},{"_id":"630f22776f75a5f478013e2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630f22776f75a5f478013e2b/RqgUE1BR8m6AlmATqSVPu.jpeg","isPro":false,"fullname":"Yura Choi","user":"Yuuraa","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2505.15804

STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

Published on May 21, 2025
· Submitted by
Zongyang Ma
on May 27, 2025
Authors:
,
,
,

Abstract

STAR-R1, a novel RL framework with a fine-grained reward mechanism, enhances spatial reasoning in multimodal large language models by addressing limitations in traditional SFT and sparse-reward RL.

AI-generated summary

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.

Community

Paper author Paper submitter
•
edited May 27, 2025

📖Paper
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.15804 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.15804 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.15804 in a Space README.md to link it from this page.

Collections including this paper 2