https://hunarbatra.com/SpatialThinker/
arXiV: https://arxiv.org/abs/2511.07403
GitHub: https://github.com/hunarbatra/SpatialThinker
$\"Animation\"$

\n","updatedAt":"2025-11-17T06:30:22.556Z","author":{"_id":"62f5c24eea5bd6b1abc8e151","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660273191881-noauth.jpeg","fullname":"Hunar Batra","name":"hunarbatra","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8391575217247009},"editors":["hunarbatra"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1660273191881-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"691bcd6c7b63899ef35acc1f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-11-18T01:35:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models](https://huggingface.co/papers/2510.08531) (2025)\n* [Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning](https://huggingface.co/papers/2510.27606) (2025)\n* [Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks](https://huggingface.co/papers/2510.25760) (2025)\n* [VLA-R1: Enhancing Reasoning in Vision-Language-Action Models](https://huggingface.co/papers/2510.01623) (2025)\n* [Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence](https://huggingface.co/papers/2510.20579) (2025)\n* [Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning](https://huggingface.co/papers/2510.11996) (2025)\n* [Visual Spatial Tuning](https://huggingface.co/papers/2511.05491) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-11-18T01:35:40.602Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7099432349205017},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2511.07403","authors":[{"_id":"69140683ac231a5726572002","user":{"_id":"62f5c24eea5bd6b1abc8e151","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660273191881-noauth.jpeg","isPro":false,"fullname":"Hunar Batra","user":"hunarbatra","type":"user"},"name":"Hunar Batra","status":"claimed_verified","statusLastChangedAt":"2025-11-12T12:18:58.669Z","hidden":false},{"_id":"69140683ac231a5726572003","name":"Haoqin Tu","hidden":false},{"_id":"69140683ac231a5726572004","name":"Hardy Chen","hidden":false},{"_id":"69140683ac231a5726572005","name":"Yuanze Lin","hidden":false},{"_id":"69140683ac231a5726572006","name":"Cihang Xie","hidden":false},{"_id":"69140683ac231a5726572007","name":"Ronald Clark","hidden":false}],"publishedAt":"2025-11-10T18:52:47.000Z","submittedOnDailyAt":"2025-11-17T04:00:22.549Z","title":"SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards","submittedOnDailyBy":{"_id":"62f5c24eea5bd6b1abc8e151","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660273191881-noauth.jpeg","isPro":false,"fullname":"Hunar Batra","user":"hunarbatra","type":"user"},"summary":"Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.","upvotes":14,"discussionId":"69140684ac231a5726572008","projectPage":"https://hunarbatra.com/SpatialThinker/","githubRepo":"https://github.com/hunarbatra/SpatialThinker","githubRepoAddedBy":"auto","ai_summary":"SpatialThinker, a 3D-aware MLLM trained with RL, enhances spatial understanding by integrating structured spatial grounding and multi-step reasoning, outperforming existing models on spatial VQA and real-world benchmarks.","ai_keywords":["MLLMs","spatial understanding","3D inputs","architecture-specific modifications","large-scale datasets","sparse supervision","SpatialThinker","scene graph","spatial rewards","data synthesis pipeline","STVQA-7K","online RL","multi-objective dense spatial reward","spatial grounding","reward-aligned reasoning","visual reasoning"],"githubStars":29,"organization":{"_id":"690ddf6f05d3fca3614552a7","name":"OX-PIXL","fullname":"Perceptual Intelligence and Extended Reality Lab","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6305ee63d70693fdf1c7dbb8/6pCxseWIvO9rzU5h5nFPr.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63834de659424581c35c42ae","avatarUrl":"/avatars/28d7e686010a4a570ce3e77698df448c.svg","isPro":false,"fullname":"Miguel Farinha","user":"mlfarinha","type":"user"},{"_id":"6305ee63d70693fdf1c7dbb8","avatarUrl":"/avatars/0e81ed3757b4e65be82063b538c3fe49.svg","isPro":false,"fullname":"Ronald Clark","user":"r0nn13","type":"user"},{"_id":"62f5c24eea5bd6b1abc8e151","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660273191881-noauth.jpeg","isPro":false,"fullname":"Hunar Batra","user":"hunarbatra","type":"user"},{"_id":"668e16bbf71499ffa7a5133c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668e16bbf71499ffa7a5133c/CSHeMp_zWRpLm-jtNt8OG.jpeg","isPro":false,"fullname":"Jerred Chen","user":"jerredchen00","type":"user"},{"_id":"662edf5f04f9341b56fa8a81","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/AucZp4zy0w7c7EjtVYujE.jpeg","isPro":false,"fullname":"Ananay arora","user":"ananayarora","type":"user"},{"_id":"684178483b8c48d81babcbf5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/684178483b8c48d81babcbf5/cFmEFnm_CPaqxb7yw17zv.jpeg","isPro":false,"fullname":"Yuanze Lin","user":"YuanzeLin","type":"user"},{"_id":"60aef0fbee40717d1a8fa6a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624676266012-60aef0fbee40717d1a8fa6a5.png","isPro":false,"fullname":"Mayank Bhaskar","user":"cataluna84","type":"user"},{"_id":"64567d1a4a7ffb7d5a492e93","avatarUrl":"/avatars/c5b5a8f4c8e3fea3e69bfb4908d544a8.svg","isPro":false,"fullname":"Jacob Lin","user":"jacoblin","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"},{"_id":"645eb61da3c5cd8a16efffff","avatarUrl":"/avatars/9112bfeed598dfabf9e077e69e09ecc9.svg","isPro":false,"fullname":"Cihang Xie","user":"cihangxie","type":"user"},{"_id":"68a74579646ad764c5cafd71","avatarUrl":"/avatars/3cb44f4373d708e5c397dd4814a716eb.svg","isPro":false,"fullname":"MihailSlutsky","user":"MihailSlutsky","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"690ddf6f05d3fca3614552a7","name":"OX-PIXL","fullname":"Perceptual Intelligence and Extended Reality Lab","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6305ee63d70693fdf1c7dbb8/6pCxseWIvO9rzU5h5nFPr.png"}}">

Papers

arxiv:2511.07403

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

Published on Nov 10, 2025

· Submitted by

Hunar Batra on Nov 17, 2025

Perceptual Intelligence and Extended Reality Lab

Upvote

Authors:

Hunar Batra ,

Abstract

SpatialThinker, a 3D-aware MLLM trained with RL, enhances spatial understanding by integrating structured spatial grounding and multi-step reasoning, outperforming existing models on spatial VQA and real-world benchmarks.

AI-generated summary

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.

View arXiv page View PDF Project page GitHub 29 auto Add to collection

Community

hunarbatra

Paper author Paper submitter Nov 17, 2025

We introduce SpatialThinker, a 3D-aware reasoning MLLM trained with dense spatial rewards via RL on 7K synthetic VQA dataset we generate, STVQA-7K. SpatialThinker achieves 2x the gains of vanilla RL and surpasses GPT-4o on several tasks.

🌟 SpatialThinker integrates scene graph-based grounding with online RL for spatial reasoning, achieving strong performance with only 7K training samples versus millions required by existing methods.
🌟 We introduce STVQA-7K, a high-quality spatial VQA dataset grounded in scene graphs, along with a scalable data generation pipeline up to 108k samples.
🌟 We design a dense, lexicographically gated multi-objective reward that guides regionally focused spatial reasoning, achieving superior in- and out-of-distribution generalization across spatial, generic VQA, and real-world benchmarks, and outperforming conventional RL and SFT baselines, open-sourced generalist and spatial MLLMs, and proprietary models.

Project Page: https://hunarbatra.com/SpatialThinker/
arXiV: https://arxiv.org/abs/2511.07403
GitHub: https://github.com/hunarbatra/SpatialThinker