Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
[go: Go Back, main page]

https://falcon-vla.github.io/

\n","updatedAt":"2025-10-29T11:11:44.397Z","author":{"_id":"6485b08e687d9e0c759121b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg","fullname":"sijin","name":"CH3COOK","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8963167071342468},"editors":["CH3COOK"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg"],"reactions":[{"reaction":"👍","users":["flameeee","dialogueeeeee","bobojiang"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.17439","authors":[{"_id":"6901c2c1646208eac0d1f577","user":{"_id":"68c14544a9a07d79e3e13166","avatarUrl":"/avatars/3c3f15bccb59d0a56c866b5c100cb35e.svg","isPro":false,"fullname":"Zhengshen Zhang","user":"flameeee","type":"user"},"name":"Zhengshen Zhang","status":"claimed_verified","statusLastChangedAt":"2025-10-29T12:40:44.683Z","hidden":false},{"_id":"6901c2c1646208eac0d1f578","user":{"_id":"667b8de7a68bf81afe668afe","avatarUrl":"/avatars/aeff10805ff858332e6f6a58735dbbd9.svg","isPro":false,"fullname":"leoli","user":"lifuguan","type":"user"},"name":"Hao Li","status":"claimed_verified","statusLastChangedAt":"2025-10-29T12:40:47.831Z","hidden":false},{"_id":"6901c2c1646208eac0d1f579","name":"Yalun Dai","hidden":false},{"_id":"6901c2c1646208eac0d1f57a","name":"Zhengbang Zhu","hidden":false},{"_id":"6901c2c1646208eac0d1f57b","user":{"_id":"66e00b969847e63c5c1a92e4","avatarUrl":"/avatars/9aa0adef21cfb98bdcc2d0c03a3687e6.svg","isPro":false,"fullname":"Lei","user":"Zray26","type":"user"},"name":"Lei Zhou","status":"claimed_verified","statusLastChangedAt":"2025-10-30T14:41:33.588Z","hidden":false},{"_id":"6901c2c1646208eac0d1f57c","name":"Chenchen Liu","hidden":false},{"_id":"6901c2c1646208eac0d1f57d","name":"Dong Wang","hidden":false},{"_id":"6901c2c1646208eac0d1f57e","name":"Francis E. H. Tay","hidden":false},{"_id":"6901c2c1646208eac0d1f57f","user":{"_id":"6485b08e687d9e0c759121b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg","isPro":false,"fullname":"sijin","user":"CH3COOK","type":"user"},"name":"Sijin Chen","status":"claimed_verified","statusLastChangedAt":"2025-10-30T14:41:35.820Z","hidden":false},{"_id":"6901c2c1646208eac0d1f580","name":"Ziwei Liu","hidden":false},{"_id":"6901c2c1646208eac0d1f581","user":{"_id":"6772383bb19143bfea926c92","avatarUrl":"/avatars/4dbaaa791a731543381455821924a265.svg","isPro":false,"fullname":"Yuxiao Liu","user":"TopoInside","type":"user"},"name":"Yuxiao Liu","status":"claimed_verified","statusLastChangedAt":"2025-11-03T20:57:45.963Z","hidden":false},{"_id":"6901c2c1646208eac0d1f582","name":"Xinghang Li","hidden":false},{"_id":"6901c2c1646208eac0d1f583","name":"Pan Zhou","hidden":false}],"publishedAt":"2025-10-20T11:26:45.000Z","submittedOnDailyAt":"2025-10-29T09:41:44.387Z","title":"From Spatial to Actions: Grounding Vision-Language-Action Model in\n Spatial Foundation Priors","submittedOnDailyBy":{"_id":"6485b08e687d9e0c759121b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg","isPro":false,"fullname":"sijin","user":"CH3COOK","type":"user"},"summary":"Existing vision-language-action (VLA) models act in 3D real-world but are\ntypically built on 2D encoders, leaving a spatial reasoning gap that limits\ngeneralization and adaptability. Recent 3D integration techniques for VLAs\neither require specialized sensors and transfer poorly across modalities, or\ninject weak cues that lack geometry and degrade vision-language alignment. In\nthis work, we introduce FALCON (From Spatial to Action), a novel paradigm that\ninjects rich 3D spatial tokens into the action head. FALCON leverages spatial\nfoundation models to deliver strong geometric priors from RGB alone, and\nincludes an Embodied Spatial Model that can optionally fuse depth, or pose for\nhigher fidelity when available, without retraining or architectural changes. To\npreserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced\nAction Head rather than being concatenated into the vision-language backbone.\nThese designs enable FALCON to address limitations in spatial representation,\nmodality transferability, and alignment. In comprehensive evaluations across\nthree simulation benchmarks and eleven real-world tasks, our proposed FALCON\nachieves state-of-the-art performance, consistently surpasses competitive\nbaselines, and remains robust under clutter, spatial-prompt conditioning, and\nvariations in object scale and height.","upvotes":27,"discussionId":"6901c2c2646208eac0d1f584","projectPage":"https://falcon-vla.github.io/","ai_summary":"FALCON enhances vision-language-action models by integrating rich 3D spatial tokens into the action head, improving spatial reasoning and modality transferability.","ai_keywords":["vision-language-action (VLA) models","3D integration","spatial tokens","spatial foundation models","Embodied Spatial Model","Spatial-Enhanced Action Head","spatial representation","modality transferability","vision-language alignment"],"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"667b8de7a68bf81afe668afe","avatarUrl":"/avatars/aeff10805ff858332e6f6a58735dbbd9.svg","isPro":false,"fullname":"leoli","user":"lifuguan","type":"user"},{"_id":"6485b08e687d9e0c759121b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg","isPro":false,"fullname":"sijin","user":"CH3COOK","type":"user"},{"_id":"6901f75a10e1584e61f05ef0","avatarUrl":"/avatars/ad768d602beae2652aebe8af5e8a689b.svg","isPro":false,"fullname":"Li Siqi","user":"siqillll","type":"user"},{"_id":"6772383bb19143bfea926c92","avatarUrl":"/avatars/4dbaaa791a731543381455821924a265.svg","isPro":false,"fullname":"Yuxiao Liu","user":"TopoInside","type":"user"},{"_id":"64ec1840c2bcaa4525eabc20","avatarUrl":"/avatars/bcd5be2ccf6527292d66d0a985ff663f.svg","isPro":false,"fullname":"Zhengbang Zhu","user":"Avada11","type":"user"},{"_id":"64a18ee3d8aea615f31a7e73","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a18ee3d8aea615f31a7e73/BEkWtg6O49ETvVoJVH_pB.jpeg","isPro":false,"fullname":"Haosong Peng","user":"Livioni","type":"user"},{"_id":"64e61105973800d851f71949","avatarUrl":"/avatars/02bcd653b60e657ed8fc4b0c61a2838c.svg","isPro":false,"fullname":"Chenchen Liu","user":"chencliu","type":"user"},{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},{"_id":"6312cab05beb528b5c1500e3","avatarUrl":"/avatars/a328e8cc99fb031b2d5c911c4b577e7e.svg","isPro":false,"fullname":"Fu-En Yang","user":"FuEnYang","type":"user"},{"_id":"64ad2f1f92772101d0394e43","avatarUrl":"/avatars/fa09408b65ed3d03a8aa8ba6965849f4.svg","isPro":false,"fullname":"Dai","user":"dialogueeeeee","type":"user"},{"_id":"67207374c7b8c87bbce57b34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/90htK2rsMH-c013TWCyJx.png","isPro":false,"fullname":"JIAWEI SUN","user":"ggosjw","type":"user"},{"_id":"66e00b969847e63c5c1a92e4","avatarUrl":"/avatars/9aa0adef21cfb98bdcc2d0c03a3687e6.svg","isPro":false,"fullname":"Lei","user":"Zray26","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}}">
Papers
arxiv:2510.17439

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

Published on Oct 20, 2025
· Submitted by
sijin
on Oct 29, 2025
Authors:
Hao Li ,
,
,
,
,
,
,
,

Abstract

FALCON enhances vision-language-action models by integrating rich 3D spatial tokens into the action head, improving spatial reasoning and modality transferability.

AI-generated summary

Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.

Community

Paper author Paper submitter

Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.

Project Page: https://falcon-vla.github.io/

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.17439 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.17439 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.17439 in a Space README.md to link it from this page.

Collections including this paper 2