Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
https://hunarbatra.com/SpatialThinker/ arXiV: https://arxiv.org/abs/2511.07403 GitHub: https://github.com/hunarbatra/SpatialThinker \n","updatedAt":"2025-11-17T06:30:22.556Z","author":{"_id":"62f5c24eea5bd6b1abc8e151","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660273191881-noauth.jpeg","fullname":"Hunar Batra","name":"hunarbatra","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8391575217247009},"editors":["hunarbatra"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1660273191881-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"691bcd6c7b63899ef35acc1f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-11-18T01:35:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models](https://huggingface.co/papers/2510.08531) (2025)\n* [Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning](https://huggingface.co/papers/2510.27606) (2025)\n* [Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks](https://huggingface.co/papers/2510.25760) (2025)\n* [VLA-R1: Enhancing Reasoning in Vision-Language-Action Models](https://huggingface.co/papers/2510.01623) (2025)\n* [Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence](https://huggingface.co/papers/2510.20579) (2025)\n* [Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning](https://huggingface.co/papers/2510.11996) (2025)\n* [Visual Spatial Tuning](https://huggingface.co/papers/2511.05491) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-11-18T01:35:40.602Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7099432349205017},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2511.07403","authors":[{"_id":"69140683ac231a5726572002","user":{"_id":"62f5c24eea5bd6b1abc8e151","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660273191881-noauth.jpeg","isPro":false,"fullname":"Hunar Batra","user":"hunarbatra","type":"user"},"name":"Hunar Batra","status":"claimed_verified","statusLastChangedAt":"2025-11-12T12:18:58.669Z","hidden":false},{"_id":"69140683ac231a5726572003","name":"Haoqin Tu","hidden":false},{"_id":"69140683ac231a5726572004","name":"Hardy Chen","hidden":false},{"_id":"69140683ac231a5726572005","name":"Yuanze Lin","hidden":false},{"_id":"69140683ac231a5726572006","name":"Cihang Xie","hidden":false},{"_id":"69140683ac231a5726572007","name":"Ronald Clark","hidden":false}],"publishedAt":"2025-11-10T18:52:47.000Z","submittedOnDailyAt":"2025-11-17T04:00:22.549Z","title":"SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards","submittedOnDailyBy":{"_id":"62f5c24eea5bd6b1abc8e151","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660273191881-noauth.jpeg","isPro":false,"fullname":"Hunar Batra","user":"hunarbatra","type":"user"},"summary":"Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.","upvotes":14,"discussionId":"69140684ac231a5726572008","projectPage":"https://hunarbatra.com/SpatialThinker/","githubRepo":"https://github.com/hunarbatra/SpatialThinker","githubRepoAddedBy":"auto","ai_summary":"SpatialThinker, a 3D-aware MLLM trained with RL, enhances spatial understanding by integrating structured spatial grounding and multi-step reasoning, outperforming existing models on spatial VQA and real-world benchmarks.","ai_keywords":["MLLMs","spatial understanding","3D inputs","architecture-specific modifications","large-scale datasets","sparse supervision","SpatialThinker","scene graph","spatial rewards","data synthesis pipeline","STVQA-7K","online RL","multi-objective dense spatial reward","spatial grounding","reward-aligned reasoning","visual reasoning"],"githubStars":29,"organization":{"_id":"690ddf6f05d3fca3614552a7","name":"OX-PIXL","fullname":"Perceptual Intelligence and Extended Reality Lab","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6305ee63d70693fdf1c7dbb8/6pCxseWIvO9rzU5h5nFPr.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63834de659424581c35c42ae","avatarUrl":"/avatars/28d7e686010a4a570ce3e77698df448c.svg","isPro":false,"fullname":"Miguel Farinha","user":"mlfarinha","type":"user"},{"_id":"6305ee63d70693fdf1c7dbb8","avatarUrl":"/avatars/0e81ed3757b4e65be82063b538c3fe49.svg","isPro":false,"fullname":"Ronald Clark","user":"r0nn13","type":"user"},{"_id":"62f5c24eea5bd6b1abc8e151","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660273191881-noauth.jpeg","isPro":false,"fullname":"Hunar Batra","user":"hunarbatra","type":"user"},{"_id":"668e16bbf71499ffa7a5133c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668e16bbf71499ffa7a5133c/CSHeMp_zWRpLm-jtNt8OG.jpeg","isPro":false,"fullname":"Jerred Chen","user":"jerredchen00","type":"user"},{"_id":"662edf5f04f9341b56fa8a81","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/AucZp4zy0w7c7EjtVYujE.jpeg","isPro":false,"fullname":"Ananay arora","user":"ananayarora","type":"user"},{"_id":"684178483b8c48d81babcbf5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/684178483b8c48d81babcbf5/cFmEFnm_CPaqxb7yw17zv.jpeg","isPro":false,"fullname":"Yuanze Lin","user":"YuanzeLin","type":"user"},{"_id":"60aef0fbee40717d1a8fa6a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624676266012-60aef0fbee40717d1a8fa6a5.png","isPro":false,"fullname":"Mayank Bhaskar","user":"cataluna84","type":"user"},{"_id":"64567d1a4a7ffb7d5a492e93","avatarUrl":"/avatars/c5b5a8f4c8e3fea3e69bfb4908d544a8.svg","isPro":false,"fullname":"Jacob Lin","user":"jacoblin","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"},{"_id":"645eb61da3c5cd8a16efffff","avatarUrl":"/avatars/9112bfeed598dfabf9e077e69e09ecc9.svg","isPro":false,"fullname":"Cihang Xie","user":"cihangxie","type":"user"},{"_id":"68a74579646ad764c5cafd71","avatarUrl":"/avatars/3cb44f4373d708e5c397dd4814a716eb.svg","isPro":false,"fullname":"MihailSlutsky","user":"MihailSlutsky","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"690ddf6f05d3fca3614552a7","name":"OX-PIXL","fullname":"Perceptual Intelligence and Extended Reality Lab","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6305ee63d70693fdf1c7dbb8/6pCxseWIvO9rzU5h5nFPr.png"}}">
SpatialThinker, a 3D-aware MLLM trained with RL, enhances spatial understanding by integrating structured spatial grounding and multi-step reasoning, outperforming existing models on spatial VQA and real-world benchmarks.
We introduce SpatialThinker, a 3D-aware reasoning MLLM trained with dense spatial rewards via RL on 7K synthetic VQA dataset we generate, STVQA-7K. SpatialThinker achieves 2x the gains of vanilla RL and surpasses GPT-4o on several tasks.
š SpatialThinker integrates scene graph-based grounding with online RL for spatial reasoning, achieving strong performance with only 7K training samples versus millions required by existing methods. š We introduce STVQA-7K, a high-quality spatial VQA dataset grounded in scene graphs, along with a scalable data generation pipeline up to 108k samples. š We design a dense, lexicographically gated multi-objective reward that guides regionally focused spatial reasoning, achieving superior in- and out-of-distribution generalization across spatial, generic VQA, and real-world benchmarks, and outperforming conventional RL and SFT baselines, open-sourced generalist and spatial MLLMs, and proprietary models.