Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
[go: Go Back, main page]

https://github.com/zhengxuJosh/Awesome-Multimodal-Spatial-Reasoning

\n","updatedAt":"2025-10-30T05:12:49.319Z","author":{"_id":"6806464ed918f6d2fee2bc8b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6806464ed918f6d2fee2bc8b/rgpG2oO0m6PT0KltCF_Wf.jpeg","fullname":"Chenfei Liao","name":"Chenfei-Liao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7089166641235352},"editors":["Chenfei-Liao"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6806464ed918f6d2fee2bc8b/rgpG2oO0m6PT0KltCF_Wf.jpeg"],"reactions":[{"reaction":"🚀","users":["Jungang","xz287"],"count":2},{"reaction":"🔥","users":["Jungang","xz287"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.25760","authors":[{"_id":"6902f25c72739622ee92a8c4","name":"Xu Zheng","hidden":false},{"_id":"6902f25c72739622ee92a8c5","name":"Zihao Dongfang","hidden":false},{"_id":"6902f25c72739622ee92a8c6","name":"Lutao Jiang","hidden":false},{"_id":"6902f25c72739622ee92a8c7","name":"Boyuan Zheng","hidden":false},{"_id":"6902f25c72739622ee92a8c8","name":"Yulong Guo","hidden":false},{"_id":"6902f25c72739622ee92a8c9","name":"Zhenquan Zhang","hidden":false},{"_id":"6902f25c72739622ee92a8ca","name":"Giuliano Albanese","hidden":false},{"_id":"6902f25c72739622ee92a8cb","name":"Runyi Yang","hidden":false},{"_id":"6902f25c72739622ee92a8cc","name":"Mengjiao Ma","hidden":false},{"_id":"6902f25c72739622ee92a8cd","user":{"_id":"674aa9af9494dd9106006c27","avatarUrl":"/avatars/2f04bb009983ec0ade738aa7941cf6dc.svg","isPro":false,"fullname":"Zixin Zhang","user":"zhangzixin02","type":"user"},"name":"Zixin Zhang","status":"claimed_verified","statusLastChangedAt":"2025-10-30T14:16:28.822Z","hidden":false},{"_id":"6902f25c72739622ee92a8ce","name":"Chenfei Liao","hidden":false},{"_id":"6902f25c72739622ee92a8cf","name":"Dingcheng Zhen","hidden":false},{"_id":"6902f25c72739622ee92a8d0","name":"Yuanhuiyi Lyu","hidden":false},{"_id":"6902f25c72739622ee92a8d1","name":"Yuqian Fu","hidden":false},{"_id":"6902f25c72739622ee92a8d2","name":"Bin Ren","hidden":false},{"_id":"6902f25c72739622ee92a8d3","name":"Linfeng Zhang","hidden":false},{"_id":"6902f25c72739622ee92a8d4","name":"Danda Pani Paudel","hidden":false},{"_id":"6902f25c72739622ee92a8d5","name":"Nicu Sebe","hidden":false},{"_id":"6902f25c72739622ee92a8d6","name":"Luc Van Gool","hidden":false},{"_id":"6902f25c72739622ee92a8d7","name":"Xuming Hu","hidden":false}],"publishedAt":"2025-10-29T17:55:43.000Z","submittedOnDailyAt":"2025-10-30T03:42:49.312Z","title":"Multimodal Spatial Reasoning in the Large Model Era: A Survey and\n Benchmarks","submittedOnDailyBy":{"_id":"6806464ed918f6d2fee2bc8b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6806464ed918f6d2fee2bc8b/rgpG2oO0m6PT0KltCF_Wf.jpeg","isPro":false,"fullname":"Chenfei Liao","user":"Chenfei-Liao","type":"user"},"summary":"Humans possess spatial reasoning abilities that enable them to understand\nspaces through multimodal observations, such as vision and sound. Large\nmultimodal reasoning models extend these abilities by learning to perceive and\nreason, showing promising performance across diverse spatial tasks. However,\nsystematic reviews and publicly available benchmarks for these models remain\nlimited. In this survey, we provide a comprehensive review of multimodal\nspatial reasoning tasks with large models, categorizing recent progress in\nmultimodal large language models (MLLMs) and introducing open benchmarks for\nevaluation. We begin by outlining general spatial reasoning, focusing on\npost-training techniques, explainability, and architecture. Beyond classical 2D\ntasks, we examine spatial relationship reasoning, scene and layout\nunderstanding, as well as visual question answering and grounding in 3D space.\nWe also review advances in embodied AI, including vision-language navigation\nand action models. Additionally, we consider emerging modalities such as audio\nand egocentric video, which contribute to novel spatial understanding through\nnew sensors. We believe this survey establishes a solid foundation and offers\ninsights into the growing field of multimodal spatial reasoning. Updated\ninformation about this survey, codes and implementation of the open benchmarks\ncan be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.","upvotes":17,"discussionId":"6902f25d72739622ee92a8d8","githubRepo":"https://github.com/zhengxuJosh/Awesome-Multimodal-Spatial-Reasoning","githubRepoAddedBy":"auto","ai_summary":"A survey of multimodal spatial reasoning tasks and models, focusing on large language models, post-training techniques, explainability, architecture, and emerging modalities like audio and egocentric video.","ai_keywords":["multimodal reasoning models","large language models","post-training techniques","explainability","architecture","spatial relationship reasoning","scene and layout understanding","visual question answering","grounding in 3D space","embodied AI","vision-language navigation","action models","audio","egocentric video"],"githubStars":279,"organization":{"_id":"660104b1569b30694e5a60f0","name":"hkust-gz","fullname":"hongkong university of science and technology"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6806464ed918f6d2fee2bc8b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6806464ed918f6d2fee2bc8b/rgpG2oO0m6PT0KltCF_Wf.jpeg","isPro":false,"fullname":"Chenfei Liao","user":"Chenfei-Liao","type":"user"},{"_id":"65db5f578c1745678f0ed708","avatarUrl":"/avatars/4e2de6f5f3a936447b7e391cb14c5346.svg","isPro":false,"fullname":"DONGFANG ZIHAO","user":"UUUserna","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"689b5b1bd34c948a78e98e8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689b5b1bd34c948a78e98e8f/z45VRKz77_HxBqE8CqE6Y.jpeg","isPro":false,"fullname":"Yu Huang","user":"hardenyu","type":"user"},{"_id":"674aa9af9494dd9106006c27","avatarUrl":"/avatars/2f04bb009983ec0ade738aa7941cf6dc.svg","isPro":false,"fullname":"Zixin Zhang","user":"zhangzixin02","type":"user"},{"_id":"65961a7daa677adc97d8e33f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65961a7daa677adc97d8e33f/TA0Vz7CDmYZEdrhnsj_1b.jpeg","isPro":false,"fullname":"Xu Zheng","user":"xz287","type":"user"},{"_id":"669f575da6a8a04c9960b600","avatarUrl":"/avatars/ec1c14022d48a39c7a529e1ddf2c1852.svg","isPro":false,"fullname":"Pei","user":"Hezep","type":"user"},{"_id":"665c476f052479b276a7239d","avatarUrl":"/avatars/f1ac6fa099efd6141a7382c6ecfced96.svg","isPro":false,"fullname":"yuwei zhang ","user":"zyw2002","type":"user"},{"_id":"6798687bbe7bf2b3fc7e8fed","avatarUrl":"/avatars/48cefbfea8ab42a8d7fcd3b3c30f0d36.svg","isPro":false,"fullname":"Yuxuan Wang","user":"yxwang1215","type":"user"},{"_id":"653b8c3e97a4d71d950e2f20","avatarUrl":"/avatars/b68880022e14556d0be58c69615db3be.svg","isPro":false,"fullname":"Zichen Wen","user":"zichenwen","type":"user"},{"_id":"6485bd278d14bcd5cdbb7c8d","avatarUrl":"/avatars/1427cf1a72b5db0cb263ad45885cf925.svg","isPro":false,"fullname":"Wenqi Zhang","user":"zwq2018","type":"user"},{"_id":"6454a3781a543cf97b1a4d89","avatarUrl":"/avatars/8183ac84b55ea8f9e191156a59eec5be.svg","isPro":false,"fullname":"CVC223366","user":"CVC2233","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"660104b1569b30694e5a60f0","name":"hkust-gz","fullname":"hongkong university of science and technology"}}">
Papers
arxiv:2510.25760

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Published on Oct 29, 2025
· Submitted by
Chenfei Liao
on Oct 30, 2025
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A survey of multimodal spatial reasoning tasks and models, focusing on large language models, post-training techniques, explainability, architecture, and emerging modalities like audio and egocentric video.

AI-generated summary

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.

Community

Paper submitter

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.25760 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.25760 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.25760 in a Space README.md to link it from this page.

Collections including this paper 3