\n\t\t\n\t\n\t
\n\t\tš„ Highlights\n\t\n\n
\n- Learn from any source, and act at anywhere.
\n- Extract highly-transferable task-centric latent actions from cross-embodiment videos.
\n- Do both manipulation and navigation well with compute-efficient training.
\n
\n","updatedAt":"2025-05-12T06:00:20.109Z","author":{"_id":"64ac1f169dcc5787461468a4","avatarUrl":"/avatars/c031a75989147009b7850df4eddfcb27.svg","fullname":"Qingwen Bu","name":"qwbu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8589341640472412},"editors":["qwbu"],"editorAvatarUrls":["/avatars/c031a75989147009b7850df4eddfcb27.svg"],"reactions":[],"isReport":false}},{"id":"682312369d0c428eae6aa7b7","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-05-13T09:34:46.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy](https://huggingface.co/papers/2503.19757) (2025)\n* [ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow](https://huggingface.co/papers/2505.01288) (2025)\n* [NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks](https://huggingface.co/papers/2504.19854) (2025)\n* [CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models](https://huggingface.co/papers/2503.22020) (2025)\n* [CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations](https://huggingface.co/papers/2505.04999) (2025)\n* [Vision-Language-Action Models: Concepts, Progress, Applications and Challenges](https://huggingface.co/papers/2505.04769) (2025)\n* [Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets](https://huggingface.co/papers/2504.02792) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-05-13T09:34:46.384Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6740142107009888},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.06111","authors":[{"_id":"68218b847202d193249511b6","user":{"_id":"64ac1f169dcc5787461468a4","avatarUrl":"/avatars/c031a75989147009b7850df4eddfcb27.svg","isPro":false,"fullname":"Qingwen Bu","user":"qwbu","type":"user"},"name":"Qingwen Bu","status":"admin_assigned","statusLastChangedAt":"2025-05-12T13:14:55.391Z","hidden":false},{"_id":"68218b847202d193249511b7","name":"Yanting Yang","hidden":false},{"_id":"68218b847202d193249511b8","user":{"_id":"66a3402e4c2093e582bdf511","avatarUrl":"/avatars/6f2e1f37b6a6cf9dc6df228482c0777a.svg","isPro":false,"fullname":"Jisong Cai","user":"SereneC","type":"user"},"name":"Jisong Cai","status":"admin_assigned","statusLastChangedAt":"2025-05-12T13:15:26.413Z","hidden":false},{"_id":"68218b847202d193249511b9","user":{"_id":"654a31c073416a223f3b5fca","avatarUrl":"/avatars/bab382c46787eaf7889ed241e12775ee.svg","isPro":false,"fullname":"Shenyuan Gao","user":"Little-Podi","type":"user"},"name":"Shenyuan Gao","status":"admin_assigned","statusLastChangedAt":"2025-05-12T13:15:33.171Z","hidden":false},{"_id":"68218b847202d193249511ba","user":{"_id":"646ec9b135f55eb49e405faa","avatarUrl":"/avatars/a17194be585d20e2a021e77a5a20e213.svg","isPro":false,"fullname":"Guanghui Ren","user":"sundrops","type":"user"},"name":"Guanghui Ren","status":"claimed_verified","statusLastChangedAt":"2025-05-12T06:50:15.305Z","hidden":false},{"_id":"68218b847202d193249511bb","user":{"_id":"67739bfa64e8b7438ae68eb4","avatarUrl":"/avatars/15193bfbce487b2de4ce8c86bd18885a.svg","isPro":false,"fullname":"Maoqing Yao","user":"AutobotZero","type":"user"},"name":"Maoqing Yao","status":"admin_assigned","statusLastChangedAt":"2025-05-12T13:15:39.850Z","hidden":false},{"_id":"68218b847202d193249511bc","user":{"_id":"67cb7d55560c3dcbb1adeaa3","avatarUrl":"/avatars/0b616d3655b0b54a621c2608b2f14379.svg","isPro":false,"fullname":"Ping Luo","user":"appleluo","type":"user"},"name":"Ping Luo","status":"admin_assigned","statusLastChangedAt":"2025-05-12T13:15:47.622Z","hidden":false},{"_id":"68218b847202d193249511bd","user":{"_id":"6499b0184936457997180c90","avatarUrl":"/avatars/b8be7bfabf746639e30330f5f623f560.svg","isPro":false,"fullname":"Hongyang Li","user":"compileme","type":"user"},"name":"Hongyang Li","status":"admin_assigned","statusLastChangedAt":"2025-05-12T13:16:21.369Z","hidden":false}],"publishedAt":"2025-05-09T15:11:13.000Z","submittedOnDailyAt":"2025-05-12T04:30:20.087Z","title":"UniVLA: Learning to Act Anywhere with Task-centric Latent Actions","submittedOnDailyBy":{"_id":"64ac1f169dcc5787461468a4","avatarUrl":"/avatars/c031a75989147009b7850df4eddfcb27.svg","isPro":false,"fullname":"Qingwen Bu","user":"qwbu","type":"user"},"summary":"A generalist robot should perform effectively across various environments.\nHowever, most existing approaches heavily rely on scaling action-annotated data\nto enhance their capabilities. Consequently, they are often limited to single\nphysical specification and struggle to learn transferable knowledge across\ndifferent embodiments and environments. To confront these limitations, we\npropose UniVLA, a new framework for learning cross-embodiment\nvision-language-action (VLA) policies. Our key innovation is to derive\ntask-centric action representations from videos with a latent action model.\nThis enables us to exploit extensive data across a wide spectrum of embodiments\nand perspectives. To mitigate the effect of task-irrelevant dynamics, we\nincorporate language instructions and establish a latent action model within\nthe DINO feature space. Learned from internet-scale videos, the generalist\npolicy can be deployed to various robots through efficient latent action\ndecoding. We obtain state-of-the-art results across multiple manipulation and\nnavigation benchmarks, as well as real-robot deployments. UniVLA achieves\nsuperior performance over OpenVLA with less than 1/20 of pretraining compute\nand 1/10 of downstream data. Continuous performance improvements are observed\nas heterogeneous data, even including human videos, are incorporated into the\ntraining pipeline. The results underscore UniVLA's potential to facilitate\nscalable and efficient robot policy learning.","upvotes":25,"discussionId":"68218b857202d19324951214","githubRepo":"https://github.com/OpenDriveLab/UniVLA","githubRepoAddedBy":"user","ai_summary":"UniVLA is a framework for learning cross-embodiment VLA policies using latent action models derived from internet-scale videos, achieving superior performance with reduced pretraining compute and downstream data.","ai_keywords":["latent action model","VLA policies","DINO feature space"],"githubStars":993},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"646ec9b135f55eb49e405faa","avatarUrl":"/avatars/a17194be585d20e2a021e77a5a20e213.svg","isPro":false,"fullname":"Guanghui Ren","user":"sundrops","type":"user"},{"_id":"64ac1f169dcc5787461468a4","avatarUrl":"/avatars/c031a75989147009b7850df4eddfcb27.svg","isPro":false,"fullname":"Qingwen Bu","user":"qwbu","type":"user"},{"_id":"630fabeea119d49bc1e56730","avatarUrl":"/avatars/f4a2bb937ed8d439fd8218f917b086ff.svg","isPro":false,"fullname":"Rodrigue Siry","user":"RodSiry","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","isPro":false,"fullname":"Andres Marafioti","user":"andito","type":"user"},{"_id":"6776855be57a4c8f9e6e7aaf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6776855be57a4c8f9e6e7aaf/rPvn5Og7NX7PXk1d1mnXP.jpeg","isPro":false,"fullname":"Dr. Chad PhD","user":"Doctor-Chad-PhD","type":"user"},{"_id":"67e0f2aa5f269eb53af3d9ec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67e0f2aa5f269eb53af3d9ec/Q7GEQMhfECj3ThmMummKE.jpeg","isPro":false,"fullname":"flybutter","user":"flybutter","type":"user"},{"_id":"66c4a97cf2eb632adde44cf8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66c4a97cf2eb632adde44cf8/F88WjCnh2-JQVHmZ0MDvA.jpeg","isPro":false,"fullname":"rollercoasterX","user":"rollercoasterX","type":"user"},{"_id":"642d678777078db98b729188","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d678777078db98b729188/lYhIEChF4qQG8ltRF3ECw.png","isPro":false,"fullname":"algorithm","user":"algorithm","type":"user"},{"_id":"66aa9040c14f47b2b6c296e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66aa9040c14f47b2b6c296e6/EFIZoCxgh45nY5twyAdhq.jpeg","isPro":false,"fullname":"dont care","user":"dont-care","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3}">
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Published on May 9, 2025
#3 Paper of the day
Abstract
UniVLA is a framework for learning cross-embodiment VLA policies using latent action models derived from internet-scale videos, achieving superior performance with reduced pretraining compute and downstream data.
A generalist robot should perform effectively across various environments.
However, most existing approaches heavily rely on scaling action-annotated data
to enhance their capabilities. Consequently, they are often limited to single
physical specification and struggle to learn transferable knowledge across
different embodiments and environments. To confront these limitations, we
propose UniVLA, a new framework for learning cross-embodiment
vision-language-action (VLA) policies. Our key innovation is to derive
task-centric action representations from videos with a latent action model.
This enables us to exploit extensive data across a wide spectrum of embodiments
and perspectives. To mitigate the effect of task-irrelevant dynamics, we
incorporate language instructions and establish a latent action model within
the DINO feature space. Learned from internet-scale videos, the generalist
policy can be deployed to various robots through efficient latent action
decoding. We obtain state-of-the-art results across multiple manipulation and
navigation benchmarks, as well as real-robot deployments. UniVLA achieves
superior performance over OpenVLA with less than 1/20 of pretraining compute
and 1/10 of downstream data. Continuous performance improvements are observed
as heterogeneous data, even including human videos, are incorporated into the
training pipeline. The results underscore UniVLA's potential to facilitate
scalable and efficient robot policy learning.