Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Learning Situated Awareness in the Real World
\n","updatedAt":"2026-02-19T03:01:17.708Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":236,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5015048980712891},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}},{"id":"6997bb80c633a08f9cfa1666","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-20T01:40:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [EgoSound: Benchmarking Sound Understanding in Egocentric Videos](https://huggingface.co/papers/2602.14122) (2026)\n* [EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding](https://huggingface.co/papers/2601.01547) (2026)\n* [SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?](https://huggingface.co/papers/2602.03916) (2026)\n* [From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs](https://huggingface.co/papers/2512.19683) (2025)\n* [Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions](https://huggingface.co/papers/2601.03590) (2026)\n* [CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments](https://huggingface.co/papers/2601.14339) (2026)\n* [Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video](https://huggingface.co/papers/2601.15780) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-20T01:40:16.389Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7223107814788818},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.16682","authors":[{"_id":"69967cbb1268a6b79e0d02cf","user":{"_id":"65415f1d5168c4f3487a2103","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65415f1d5168c4f3487a2103/qJuzDpOGSDL4E1-L2lGWW.jpeg","isPro":false,"fullname":"Chuhan Li","user":"ChuhanLi","type":"user"},"name":"Chuhan Li","status":"claimed_verified","statusLastChangedAt":"2026-02-20T08:37:30.993Z","hidden":false},{"_id":"69967cbb1268a6b79e0d02d0","name":"Ruilin Han","hidden":false},{"_id":"69967cbb1268a6b79e0d02d1","name":"Joy Hsu","hidden":false},{"_id":"69967cbb1268a6b79e0d02d2","name":"Yongyuan Liang","hidden":false},{"_id":"69967cbb1268a6b79e0d02d3","name":"Rajiv Dhawan","hidden":false},{"_id":"69967cbb1268a6b79e0d02d4","name":"Jiajun Wu","hidden":false},{"_id":"69967cbb1268a6b79e0d02d5","name":"Ming-Hsuan Yang","hidden":false},{"_id":"69967cbb1268a6b79e0d02d6","name":"Xin Eric Wang","hidden":false}],"publishedAt":"2026-02-18T18:22:52.000Z","submittedOnDailyAt":"2026-02-19T00:30:39.390Z","title":"Learning Situated Awareness in the Real World","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.","upvotes":5,"discussionId":"69967cbc1268a6b79e0d02d7","projectPage":"https://sawbench.github.io/","ai_summary":"SAW-Bench presents a new benchmark for evaluating egocentric situated awareness in multimodal foundation models through real-world video datasets with human-annotated question-answer pairs, focusing on observer-centric spatial reasoning tasks.","ai_keywords":["multimodal foundation models","egocentric videos","observer-centric relationships","situated awareness","spatial reasoning","camera geometry","real-world videos","question-answer pairs"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6534a434e778506c5b1e5be8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6534a434e778506c5b1e5be8/349SdAnjEdIQJSzWvKfZ4.png","isPro":true,"fullname":"Xirui Li","user":"AIcell","type":"user"},{"_id":"66935bdc5489e4f73c76bc7b","avatarUrl":"/avatars/129d1e86bbaf764b507501f4feb177db.svg","isPro":false,"fullname":"Abidoye Aanuoluwapo","user":"Aanuoluwapo65","type":"user"},{"_id":"65415f1d5168c4f3487a2103","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65415f1d5168c4f3487a2103/qJuzDpOGSDL4E1-L2lGWW.jpeg","isPro":false,"fullname":"Chuhan Li","user":"ChuhanLi","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
SAW-Bench presents a new benchmark for evaluating egocentric situated awareness in multimodal foundation models through real-world video datasets with human-annotated question-answer pairs, focusing on observer-centric spatial reasoning tasks.
AI-generated summary
A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.
SAW-Bench introduces egocentric situated awareness in real-world video, challenging models with observer-centric reasoning and six tasks, revealing a substantial human-model gap.