Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
[go: Go Back, main page]

https://arxiv.org/abs/2412.14171
🌎Project Page: https://vision-x-nyu.github.io/thinking-in-space.github.io/
💻Code: https://github.com/vision-x-nyu/thinking-in-space.git
🏁Benchmark: https://huggingface.co/datasets/nyu-visionx/VSI-Bench

\n","updatedAt":"2024-12-19T15:49:52.088Z","author":{"_id":"627ccf058b4e56cfc2716425","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652346592327-noauth.jpeg","fullname":"Shusheng Yang","name":"ShushengYang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.46386098861694336},"editors":["ShushengYang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1652346592327-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"6764c9aa0845e644fe4239b9","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2024-12-20T01:34:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination](https://huggingface.co/papers/2411.12591) (2024)\n* [An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models](https://huggingface.co/papers/2411.06048) (2024)\n* [SAT: Spatial Aptitude Training for Multimodal Language Models](https://huggingface.co/papers/2412.07755) (2024)\n* [EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios](https://huggingface.co/papers/2412.04447) (2024)\n* [HourVideo: 1-Hour Video-Language Understanding](https://huggingface.co/papers/2411.04998) (2024)\n* [Perception Tokens Enhance Visual Reasoning in Multimodal Language Models](https://huggingface.co/papers/2412.03548) (2024)\n* [MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark](https://huggingface.co/papers/2410.19168) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-12-20T01:34:34.361Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6856517195701599},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2412.14171","authors":[{"_id":"676426104fa1553ccc44451f","user":{"_id":"6304baf041387c7f1177a5d2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304baf041387c7f1177a5d2/cQgCR8AsrMUaF2QVh97I9.jpeg","isPro":true,"fullname":"Jihan Yang","user":"jihanyang","type":"user"},"name":"Jihan Yang","status":"admin_assigned","statusLastChangedAt":"2024-12-20T09:53:33.378Z","hidden":false},{"_id":"676426104fa1553ccc444520","user":{"_id":"627ccf058b4e56cfc2716425","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652346592327-noauth.jpeg","isPro":false,"fullname":"Shusheng Yang","user":"ShushengYang","type":"user"},"name":"Shusheng Yang","status":"claimed_verified","statusLastChangedAt":"2024-12-19T18:00:36.526Z","hidden":false},{"_id":"676426104fa1553ccc444521","user":{"_id":"654a6c59f8ebcec54510ee56","avatarUrl":"/avatars/31089658df2d754ab6e4f6ed2750cc1e.svg","isPro":false,"fullname":"Anjali W Gupta","user":"anjaliwgupta","type":"user"},"name":"Anjali W. Gupta","status":"claimed_verified","statusLastChangedAt":"2024-12-21T15:19:58.834Z","hidden":false},{"_id":"676426104fa1553ccc444522","user":{"_id":"657ad3ed2b365884914652c0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657ad3ed2b365884914652c0/J86SIaWafJEtqxWH84o0Z.jpeg","isPro":false,"fullname":"Rilyn Han","user":"rilynhan","type":"user"},"name":"Rilyn Han","status":"admin_assigned","statusLastChangedAt":"2024-12-20T09:53:14.391Z","hidden":false},{"_id":"676426104fa1553ccc444523","name":"Li Fei-Fei","hidden":false},{"_id":"676426104fa1553ccc444524","user":{"_id":"6596422646624a86ff3b3bda","avatarUrl":"/avatars/216e12b77e45ac5f1fa20932f5745411.svg","isPro":false,"fullname":"Saining Xie","user":"sainx","type":"user"},"name":"Saining Xie","status":"admin_assigned","statusLastChangedAt":"2024-12-20T09:52:57.712Z","hidden":false}],"publishedAt":"2024-12-18T18:59:54.000Z","submittedOnDailyAt":"2024-12-19T13:19:52.016Z","title":"Thinking in Space: How Multimodal Large Language Models See, Remember,\n and Recall Spaces","submittedOnDailyBy":{"_id":"627ccf058b4e56cfc2716425","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652346592327-noauth.jpeg","isPro":false,"fullname":"Shusheng Yang","user":"ShushengYang","type":"user"},"summary":"Humans possess the visual-spatial intelligence to remember spaces from\nsequential visual observations. However, can Multimodal Large Language Models\n(MLLMs) trained on million-scale video datasets also ``think in space'' from\nvideos? We present a novel video-based visual-spatial intelligence benchmark\n(VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit\ncompetitive - though subhuman - visual-spatial intelligence. We probe models to\nexpress how they think in space both linguistically and visually and find that\nwhile spatial reasoning capabilities remain the primary bottleneck for MLLMs to\nreach higher benchmark performance, local world models and spatial awareness do\nemerge within these models. Notably, prevailing linguistic reasoning techniques\n(e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve\nperformance, whereas explicitly generating cognitive maps during\nquestion-answering enhances MLLMs' spatial distance ability.","upvotes":24,"discussionId":"676426114fa1553ccc44458c","githubRepo":"https://github.com/vision-x-nyu/thinking-in-space","githubRepoAddedBy":"auto","ai_summary":"MLLMs trained on large video datasets show competitive but subhuman visual-spatial intelligence, with spatial reasoning as a key bottleneck that can be improved by generating cognitive maps.","ai_keywords":["Multimodal Large Language Models","VSI-Bench","visual-spatial intelligence","spatial reasoning","cognitive maps"],"githubStars":673},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"627ccf058b4e56cfc2716425","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652346592327-noauth.jpeg","isPro":false,"fullname":"Shusheng Yang","user":"ShushengYang","type":"user"},{"_id":"6304baf041387c7f1177a5d2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304baf041387c7f1177a5d2/cQgCR8AsrMUaF2QVh97I9.jpeg","isPro":true,"fullname":"Jihan Yang","user":"jihanyang","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"63b2a92e18e5cf2cdd333492","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b2a92e18e5cf2cdd333492/GxnngJG0u7d0jYTEFOrfe.png","isPro":false,"fullname":"Jaehyun Jun","user":"btjhjeon","type":"user"},{"_id":"6126250746cc6aab1f590b99","avatarUrl":"/avatars/364be3f726c1c7e37ebc61ddb5687f8a.svg","isPro":false,"fullname":"Weipeng DENG","user":"VincentDENG","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"},{"_id":"6409f386f3dabf93824bdcd2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6409f386f3dabf93824bdcd2/4OsM9ur7C65QRDYPUuD4I.jpeg","isPro":false,"fullname":"Ougrid Dumdang","user":"Ougrid-D","type":"user"},{"_id":"63a7422854f1d0225b075bfc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a7422854f1d0225b075bfc/XGYAcDPZG5ZEsNBWG6guw.jpeg","isPro":true,"fullname":"lhl","user":"leonardlin","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2412.14171

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Published on Dec 18, 2024
· Submitted by
Shusheng Yang
on Dec 19, 2024

Abstract

MLLMs trained on large video datasets show competitive but subhuman visual-spatial intelligence, with spatial reasoning as a key bottleneck that can be improved by generating cognitive maps.

AI-generated summary

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.14171 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 13