Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - From Spatial to Actions: Grounding Vision-Language-Action Model in
Spatial Foundation Priors
https://falcon-vla.github.io/\n","updatedAt":"2025-10-29T11:11:44.397Z","author":{"_id":"6485b08e687d9e0c759121b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg","fullname":"sijin","name":"CH3COOK","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8963167071342468},"editors":["CH3COOK"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg"],"reactions":[{"reaction":"👍","users":["flameeee","dialogueeeeee","bobojiang"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.17439","authors":[{"_id":"6901c2c1646208eac0d1f577","user":{"_id":"68c14544a9a07d79e3e13166","avatarUrl":"/avatars/3c3f15bccb59d0a56c866b5c100cb35e.svg","isPro":false,"fullname":"Zhengshen Zhang","user":"flameeee","type":"user"},"name":"Zhengshen Zhang","status":"claimed_verified","statusLastChangedAt":"2025-10-29T12:40:44.683Z","hidden":false},{"_id":"6901c2c1646208eac0d1f578","user":{"_id":"667b8de7a68bf81afe668afe","avatarUrl":"/avatars/aeff10805ff858332e6f6a58735dbbd9.svg","isPro":false,"fullname":"leoli","user":"lifuguan","type":"user"},"name":"Hao Li","status":"claimed_verified","statusLastChangedAt":"2025-10-29T12:40:47.831Z","hidden":false},{"_id":"6901c2c1646208eac0d1f579","name":"Yalun Dai","hidden":false},{"_id":"6901c2c1646208eac0d1f57a","name":"Zhengbang Zhu","hidden":false},{"_id":"6901c2c1646208eac0d1f57b","user":{"_id":"66e00b969847e63c5c1a92e4","avatarUrl":"/avatars/9aa0adef21cfb98bdcc2d0c03a3687e6.svg","isPro":false,"fullname":"Lei","user":"Zray26","type":"user"},"name":"Lei Zhou","status":"claimed_verified","statusLastChangedAt":"2025-10-30T14:41:33.588Z","hidden":false},{"_id":"6901c2c1646208eac0d1f57c","name":"Chenchen Liu","hidden":false},{"_id":"6901c2c1646208eac0d1f57d","name":"Dong Wang","hidden":false},{"_id":"6901c2c1646208eac0d1f57e","name":"Francis E. H. Tay","hidden":false},{"_id":"6901c2c1646208eac0d1f57f","user":{"_id":"6485b08e687d9e0c759121b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg","isPro":false,"fullname":"sijin","user":"CH3COOK","type":"user"},"name":"Sijin Chen","status":"claimed_verified","statusLastChangedAt":"2025-10-30T14:41:35.820Z","hidden":false},{"_id":"6901c2c1646208eac0d1f580","name":"Ziwei Liu","hidden":false},{"_id":"6901c2c1646208eac0d1f581","user":{"_id":"6772383bb19143bfea926c92","avatarUrl":"/avatars/4dbaaa791a731543381455821924a265.svg","isPro":false,"fullname":"Yuxiao Liu","user":"TopoInside","type":"user"},"name":"Yuxiao Liu","status":"claimed_verified","statusLastChangedAt":"2025-11-03T20:57:45.963Z","hidden":false},{"_id":"6901c2c1646208eac0d1f582","name":"Xinghang Li","hidden":false},{"_id":"6901c2c1646208eac0d1f583","name":"Pan Zhou","hidden":false}],"publishedAt":"2025-10-20T11:26:45.000Z","submittedOnDailyAt":"2025-10-29T09:41:44.387Z","title":"From Spatial to Actions: Grounding Vision-Language-Action Model in\n Spatial Foundation Priors","submittedOnDailyBy":{"_id":"6485b08e687d9e0c759121b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg","isPro":false,"fullname":"sijin","user":"CH3COOK","type":"user"},"summary":"Existing vision-language-action (VLA) models act in 3D real-world but are\ntypically built on 2D encoders, leaving a spatial reasoning gap that limits\ngeneralization and adaptability. Recent 3D integration techniques for VLAs\neither require specialized sensors and transfer poorly across modalities, or\ninject weak cues that lack geometry and degrade vision-language alignment. In\nthis work, we introduce FALCON (From Spatial to Action), a novel paradigm that\ninjects rich 3D spatial tokens into the action head. FALCON leverages spatial\nfoundation models to deliver strong geometric priors from RGB alone, and\nincludes an Embodied Spatial Model that can optionally fuse depth, or pose for\nhigher fidelity when available, without retraining or architectural changes. To\npreserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced\nAction Head rather than being concatenated into the vision-language backbone.\nThese designs enable FALCON to address limitations in spatial representation,\nmodality transferability, and alignment. In comprehensive evaluations across\nthree simulation benchmarks and eleven real-world tasks, our proposed FALCON\nachieves state-of-the-art performance, consistently surpasses competitive\nbaselines, and remains robust under clutter, spatial-prompt conditioning, and\nvariations in object scale and height.","upvotes":27,"discussionId":"6901c2c2646208eac0d1f584","projectPage":"https://falcon-vla.github.io/","ai_summary":"FALCON enhances vision-language-action models by integrating rich 3D spatial tokens into the action head, improving spatial reasoning and modality transferability.","ai_keywords":["vision-language-action (VLA) models","3D integration","spatial tokens","spatial foundation models","Embodied Spatial Model","Spatial-Enhanced Action Head","spatial representation","modality transferability","vision-language alignment"],"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"667b8de7a68bf81afe668afe","avatarUrl":"/avatars/aeff10805ff858332e6f6a58735dbbd9.svg","isPro":false,"fullname":"leoli","user":"lifuguan","type":"user"},{"_id":"6485b08e687d9e0c759121b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg","isPro":false,"fullname":"sijin","user":"CH3COOK","type":"user"},{"_id":"6901f75a10e1584e61f05ef0","avatarUrl":"/avatars/ad768d602beae2652aebe8af5e8a689b.svg","isPro":false,"fullname":"Li Siqi","user":"siqillll","type":"user"},{"_id":"6772383bb19143bfea926c92","avatarUrl":"/avatars/4dbaaa791a731543381455821924a265.svg","isPro":false,"fullname":"Yuxiao Liu","user":"TopoInside","type":"user"},{"_id":"64ec1840c2bcaa4525eabc20","avatarUrl":"/avatars/bcd5be2ccf6527292d66d0a985ff663f.svg","isPro":false,"fullname":"Zhengbang Zhu","user":"Avada11","type":"user"},{"_id":"64a18ee3d8aea615f31a7e73","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a18ee3d8aea615f31a7e73/BEkWtg6O49ETvVoJVH_pB.jpeg","isPro":false,"fullname":"Haosong Peng","user":"Livioni","type":"user"},{"_id":"64e61105973800d851f71949","avatarUrl":"/avatars/02bcd653b60e657ed8fc4b0c61a2838c.svg","isPro":false,"fullname":"Chenchen Liu","user":"chencliu","type":"user"},{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},{"_id":"6312cab05beb528b5c1500e3","avatarUrl":"/avatars/a328e8cc99fb031b2d5c911c4b577e7e.svg","isPro":false,"fullname":"Fu-En Yang","user":"FuEnYang","type":"user"},{"_id":"64ad2f1f92772101d0394e43","avatarUrl":"/avatars/fa09408b65ed3d03a8aa8ba6965849f4.svg","isPro":false,"fullname":"Dai","user":"dialogueeeeee","type":"user"},{"_id":"67207374c7b8c87bbce57b34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/90htK2rsMH-c013TWCyJx.png","isPro":false,"fullname":"JIAWEI SUN","user":"ggosjw","type":"user"},{"_id":"66e00b969847e63c5c1a92e4","avatarUrl":"/avatars/9aa0adef21cfb98bdcc2d0c03a3687e6.svg","isPro":false,"fullname":"Lei","user":"Zray26","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}}">
FALCON enhances vision-language-action models by integrating rich 3D spatial tokens into the action head, improving spatial reasoning and modality transferability.
AI-generated summary
Existing vision-language-action (VLA) models act in 3D real-world but are
typically built on 2D encoders, leaving a spatial reasoning gap that limits
generalization and adaptability. Recent 3D integration techniques for VLAs
either require specialized sensors and transfer poorly across modalities, or
inject weak cues that lack geometry and degrade vision-language alignment. In
this work, we introduce FALCON (From Spatial to Action), a novel paradigm that
injects rich 3D spatial tokens into the action head. FALCON leverages spatial
foundation models to deliver strong geometric priors from RGB alone, and
includes an Embodied Spatial Model that can optionally fuse depth, or pose for
higher fidelity when available, without retraining or architectural changes. To
preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced
Action Head rather than being concatenated into the vision-language backbone.
These designs enable FALCON to address limitations in spatial representation,
modality transferability, and alignment. In comprehensive evaluations across
three simulation benchmarks and eleven real-world tasks, our proposed FALCON
achieves state-of-the-art performance, consistently surpasses competitive
baselines, and remains robust under clutter, spatial-prompt conditioning, and
variations in object scale and height.
Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.