Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

HoloBrain-0 Technical Report (2026)
BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation (2026)
LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer (2026)
Causal World Modeling for Robot Control (2026)
GR-Dexter Technical Report (2025)
A Pragmatic VLA Foundation Model (2026)
Robotic VLA Benefits from Joint Learning with Motion Image Diffusion (2025)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-17T01:41:23.449Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6887467503547668},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.12684","authors":[{"_id":"69928a9e50fb2c0be4783917","name":"Rui Cai","hidden":false},{"_id":"69928a9e50fb2c0be4783918","name":"Jun Guo","hidden":false},{"_id":"69928a9e50fb2c0be4783919","name":"Xinze He","hidden":false},{"_id":"69928a9e50fb2c0be478391a","name":"Piaopiao Jin","hidden":false},{"_id":"69928a9e50fb2c0be478391b","name":"Jie Li","hidden":false},{"_id":"69928a9e50fb2c0be478391c","name":"Bingxuan Lin","hidden":false},{"_id":"69928a9e50fb2c0be478391d","name":"Futeng Liu","hidden":false},{"_id":"69928a9e50fb2c0be478391e","name":"Wei Liu","hidden":false},{"_id":"69928a9e50fb2c0be478391f","name":"Fei Ma","hidden":false},{"_id":"69928a9e50fb2c0be4783920","name":"Kun Ma","hidden":false},{"_id":"69928a9e50fb2c0be4783921","name":"Feng Qiu","hidden":false},{"_id":"69928a9e50fb2c0be4783922","name":"Heng Qu","hidden":false},{"_id":"69928a9e50fb2c0be4783923","name":"Yifei Su","hidden":false},{"_id":"69928a9e50fb2c0be4783924","name":"Qiao Sun","hidden":false},{"_id":"69928a9e50fb2c0be4783925","name":"Dong Wang","hidden":false},{"_id":"69928a9e50fb2c0be4783926","name":"Donghao Wang","hidden":false},{"_id":"69928a9e50fb2c0be4783927","name":"Yunhong Wang","hidden":false},{"_id":"69928a9e50fb2c0be4783928","name":"Rujie Wu","hidden":false},{"_id":"69928a9e50fb2c0be4783929","name":"Diyun Xiang","hidden":false},{"_id":"69928a9e50fb2c0be478392a","name":"Yu Yang","hidden":false},{"_id":"69928a9e50fb2c0be478392b","name":"Hangjun Ye","hidden":false},{"_id":"69928a9e50fb2c0be478392c","name":"Yuan Zhang","hidden":false},{"_id":"69928a9e50fb2c0be478392d","name":"Quanyun Zhou","hidden":false}],"publishedAt":"2026-02-13T07:30:43.000Z","submittedOnDailyAt":"2026-02-16T00:40:36.822Z","title":"Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io","upvotes":5,"discussionId":"69928a9e50fb2c0be478392e","projectPage":"https://xiaomi-robotics-0.github.io/","githubRepo":"https://github.com/XiaomiRobotics/Xiaomi-Robotics-0","githubRepoAddedBy":"user","ai_summary":"A vision-language-action model for robotics combines large-scale pretraining with specialized training techniques to enable real-time execution and high-performance manipulation tasks.","ai_keywords":["vision-language-action","cross-embodiment robot trajectories","pre-trained VLM","catastrophic forgetting","asynchronous execution","inference latency","real-time rollouts","bimanual manipulation","simulation benchmarks","real-robot tasks"],"githubStars":250,"organization":{"_id":"6821ba7e5a7efab94a235406","name":"xiaomi-research","fullname":"Xiaomi Research","avatar":"https://cdn-uploads.huggingface.co/production/uploads/673735e4373ad40af7f81ea1/DR4m0bz2Du1l0Z8Txg351.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6312cab05beb528b5c1500e3","avatarUrl":"/avatars/a328e8cc99fb031b2d5c911c4b577e7e.svg","isPro":false,"fullname":"Fu-En Yang","user":"FuEnYang","type":"user"},{"_id":"6984e06c2f56a9fe4fdbe727","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xLLd5qpE4H2gJ8bfE0gN6.png","isPro":false,"fullname":"Mark Lu","user":"dyanpwild","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"6821ba7e5a7efab94a235406","name":"xiaomi-research","fullname":"Xiaomi Research","avatar":"https://cdn-uploads.huggingface.co/production/uploads/673735e4373ad40af7f81ea1/DR4m0bz2Du1l0Z8Txg351.png"}}">

Papers

arxiv:2602.12684

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

Published on Feb 13

· Submitted by

taesiri on Feb 16

Xiaomi Research

Upvote

Authors:

Abstract

A vision-language-action model for robotics combines large-scale pretraining with specialized training techniques to enable real-time execution and high-performance manipulation tasks.

AI-generated summary

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io

View arXiv page View PDF Project page GitHub 250 Add to collection

Community

taesiri

Paper submitter 4 days ago

Open-source Xiaomi-Robotics-0 delivers a vision-language-action model enabling real-time robotic action with cross-embodiment data and VLMs, plus asynchronous inference and smooth real-time rollouts.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.12684 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.12684 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.12684 in a Space README.md to link it from this page.