Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
[go: Go Back, main page]

@willnorris\n\t thank you so much for your words---we're glad you liked the report, and async inference πŸ˜‰
We're hard at work to make sure the stack lands on main soon. It's already compatible with all the policy types LeRobot supports, and open-sourcing everything is our effort to make this the standard paradigm for the community. Why lagging? πŸ€“

\n

If you're interested in following progress, check the PR here πŸ”— https://github.com/huggingface/lerobot/pull/1196

\n","updatedAt":"2025-06-04T15:11:55.655Z","author":{"_id":"63d67eac6f49aa8230601996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg","fullname":"Francesco Capuano","name":"fracapuano","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":141,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9383063316345215},"editors":["fracapuano"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"684017a2c43f1c96099b9b82"}},{"id":"684062d8ba3e6fdf13d9b8e7","author":{"_id":"63d67eac6f49aa8230601996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg","fullname":"Francesco Capuano","name":"fracapuano","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":141,"isUserFollowing":false},"createdAt":"2025-06-04T15:14:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I was thinking about releasing a blogpost to detail the async architecture, and empower the community with more background about it. Anything in particular you feel we didn't cover well enough in the report @willnorris?","html":"

I was thinking about releasing a blogpost to detail the async architecture, and empower the community with more background about it. Anything in particular you feel we didn't cover well enough in the report \n\n@willnorris\n\t?

\n","updatedAt":"2025-06-04T15:14:32.084Z","author":{"_id":"63d67eac6f49aa8230601996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg","fullname":"Francesco Capuano","name":"fracapuano","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":141,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9763376712799072},"editors":["fracapuano"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"684017a2c43f1c96099b9b82"}},{"id":"6841913b1bb9a4ac1e63c2dc","author":{"_id":"67812d1e1c244e2a4b4a3aa3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/7_nqoXOFj3G7J-YojHDIB.png","fullname":"Will Norris","name":"willnorris","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-06-05T12:44:43.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Think it would be really cool to see both the sync & async timing illustrations side by side, can be hard to grasp without that!\n\n![CleanShot 2025-06-05 at 13.43.36@2x.png](https://cdn-uploads.huggingface.co/production/uploads/67812d1e1c244e2a4b4a3aa3/gaXbf82_p7dkPiAJXZjPO.png)\n","html":"

Think it would be really cool to see both the sync & async timing illustrations side by side, can be hard to grasp without that!

\n

\"CleanShot

\n","updatedAt":"2025-06-05T12:44:43.877Z","author":{"_id":"67812d1e1c244e2a4b4a3aa3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/7_nqoXOFj3G7J-YojHDIB.png","fullname":"Will Norris","name":"willnorris","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7763450741767883},"editors":["willnorris"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/7_nqoXOFj3G7J-YojHDIB.png"],"reactions":[{"reaction":"πŸ”₯","users":["fracapuano"],"count":1}],"isReport":false,"parentCommentId":"684017a2c43f1c96099b9b82"}},{"id":"68441becfce52286598e5b93","author":{"_id":"63d67eac6f49aa8230601996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg","fullname":"Francesco Capuano","name":"fracapuano","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":141,"isUserFollowing":false},"createdAt":"2025-06-07T11:01:00.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hey @willnorris πŸ‘‹ Sure thing, it makes a lot of sense to illustrate the architecture we have designed against a traditional sync strategy. While we work on it, you might check out [this comment](https://github.com/huggingface/lerobot/pull/1196#issuecomment-2936144394) in the Async Inference PR---it compares sync vs async graphically\n\nLet me know if you other questions πŸ€—","html":"

Hey \n\n@willnorris\n\t πŸ‘‹ Sure thing, it makes a lot of sense to illustrate the architecture we have designed against a traditional sync strategy. While we work on it, you might check out this comment in the Async Inference PR---it compares sync vs async graphically

\n

Let me know if you other questions πŸ€—

\n","updatedAt":"2025-06-07T11:01:00.193Z","author":{"_id":"63d67eac6f49aa8230601996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg","fullname":"Francesco Capuano","name":"fracapuano","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":141,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8499990105628967},"editors":["fracapuano"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"684017a2c43f1c96099b9b82"}}]},{"id":"68406223b570695e02ba6946","author":{"_id":"63d67eac6f49aa8230601996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg","fullname":"Francesco Capuano","name":"fracapuano","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":141,"isUserFollowing":false},"createdAt":"2025-06-04T15:11:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hey @willnorris thank you so much for your words---we're glad you liked the report, and async inference πŸ˜‰\nWe're hard at work to make sure the stack lands on main soon. It's already compatible with all the policy types LeRobot supports, and open-sourcing everything is our effort to make this the standard paradigm for the community. Why lagging? πŸ€“\n\nIf you're interested in following progress, check the PR here πŸ”— https://github.com/huggingface/lerobot/pull/1196","html":"

Hey \n\n@willnorris\n\t thank you so much for your words---we're glad you liked the report, and async inference πŸ˜‰
We're hard at work to make sure the stack lands on main soon. It's already compatible with all the policy types LeRobot supports, and open-sourcing everything is our effort to make this the standard paradigm for the community. Why lagging? πŸ€“

\n

If you're interested in following progress, check the PR here πŸ”— https://github.com/huggingface/lerobot/pull/1196

\n","updatedAt":"2025-06-04T15:11:31.513Z","author":{"_id":"63d67eac6f49aa8230601996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg","fullname":"Francesco Capuano","name":"fracapuano","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":141,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9383063316345215},"editors":["fracapuano"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg"],"reactions":[],"isReport":false}},{"id":"6840f6447a14e3abf593e181","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-06-05T01:43:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models](https://huggingface.co/papers/2505.21200) (2025)\n* [NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks](https://huggingface.co/papers/2504.19854) (2025)\n* [Interactive Post-Training for Vision-Language-Action Models](https://huggingface.co/papers/2505.17016) (2025)\n* [ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning](https://huggingface.co/papers/2505.19080) (2025)\n* [ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge](https://huggingface.co/papers/2505.21906) (2025)\n* [VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning](https://huggingface.co/papers/2505.18719) (2025)\n* [From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems](https://huggingface.co/papers/2505.15685) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-06-05T01:43:32.622Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7461104989051819},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"68442cf52c11a31f88e320ff","author":{"_id":"6842d4f2028e1a1a8b7e2c42","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/MvkDkQ27QKR2mdyosdfgF.png","fullname":"ZHAO Runyi","name":"ZhaoRunyi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-06-07T12:13:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Is only SO-100 execution data included in the dataset for pretraining?","html":"

Is only SO-100 execution data included in the dataset for pretraining?

\n","updatedAt":"2025-06-07T12:13:41.609Z","author":{"_id":"6842d4f2028e1a1a8b7e2c42","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/MvkDkQ27QKR2mdyosdfgF.png","fullname":"ZHAO Runyi","name":"ZhaoRunyi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9249262809753418},"editors":["ZhaoRunyi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/MvkDkQ27QKR2mdyosdfgF.png"],"reactions":[],"isReport":false},"replies":[{"id":"68594499b96f6db7d575b1ce","author":{"_id":"640e21ef3c82bd463ee5a76d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640e21ef3c82bd463ee5a76d/nVR1DFPAsiLw6Boys28Rb.jpeg","fullname":"Dana Aubakirova","name":"danaaubakirova","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":202,"isUserFollowing":false},"createdAt":"2025-06-23T12:12:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"yes, correct! ","html":"

yes, correct!

\n","updatedAt":"2025-06-23T12:12:09.718Z","author":{"_id":"640e21ef3c82bd463ee5a76d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640e21ef3c82bd463ee5a76d/nVR1DFPAsiLw6Boys28Rb.jpeg","fullname":"Dana Aubakirova","name":"danaaubakirova","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":202,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.39856699109077454},"editors":["danaaubakirova"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/640e21ef3c82bd463ee5a76d/nVR1DFPAsiLw6Boys28Rb.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"68442cf52c11a31f88e320ff"}}]},{"id":"68545de8978793803824541c","author":{"_id":"665b0fbff37c86cf8a7d1038","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665b0fbff37c86cf8a7d1038/bkvvZHSbKCD1r63GCwoIx.jpeg","fullname":"Gleb Zarin","name":"zaringleb","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-06-19T18:58:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hey guys,\nAmazing work!\n\nI think there is a typo: in \"Improving Task Annotations\" section, the link ref to Qwen/Qwen2.5-3B-Instruct (should be Qwen/Qwen2.5-VL-3B-Instruct)","html":"

Hey guys,
Amazing work!

\n

I think there is a typo: in \"Improving Task Annotations\" section, the link ref to Qwen/Qwen2.5-3B-Instruct (should be Qwen/Qwen2.5-VL-3B-Instruct)

\n","updatedAt":"2025-06-19T18:58:48.800Z","author":{"_id":"665b0fbff37c86cf8a7d1038","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665b0fbff37c86cf8a7d1038/bkvvZHSbKCD1r63GCwoIx.jpeg","fullname":"Gleb Zarin","name":"zaringleb","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8874974250793457},"editors":["zaringleb"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/665b0fbff37c86cf8a7d1038/bkvvZHSbKCD1r63GCwoIx.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"68594464d696c9b3ff07067d","author":{"_id":"640e21ef3c82bd463ee5a76d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640e21ef3c82bd463ee5a76d/nVR1DFPAsiLw6Boys28Rb.jpeg","fullname":"Dana Aubakirova","name":"danaaubakirova","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":202,"isUserFollowing":false},"createdAt":"2025-06-23T12:11:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hello! Thank you!\n\nGreat catch! Yes, you are right the correct link ref is Qwen/Qwen2.5-VL-3B-Instruct.\nSorry about this! ","html":"

Hello! Thank you!

\n

Great catch! Yes, you are right the correct link ref is Qwen/Qwen2.5-VL-3B-Instruct.
Sorry about this!

\n","updatedAt":"2025-06-23T12:11:16.095Z","author":{"_id":"640e21ef3c82bd463ee5a76d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640e21ef3c82bd463ee5a76d/nVR1DFPAsiLw6Boys28Rb.jpeg","fullname":"Dana Aubakirova","name":"danaaubakirova","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":202,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8184675574302673},"editors":["danaaubakirova"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/640e21ef3c82bd463ee5a76d/nVR1DFPAsiLw6Boys28Rb.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"68545de8978793803824541c"}},{"id":"685a9f1c8fdafbfc07b6b6fc","author":{"_id":"665b0fbff37c86cf8a7d1038","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665b0fbff37c86cf8a7d1038/bkvvZHSbKCD1r63GCwoIx.jpeg","fullname":"Gleb Zarin","name":"zaringleb","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-06-24T12:50:36.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"A couple of typos in the paper itself:\n1. \"At ∼ 10M episodes frames, our pretraining set stands at least one order of magnitude smaller than other stateof-the-art.\"\n2. \"The datasets record trajectories relative to *Unless* (?) specified otherwise\" (something is missing)\n3. β€œIn Table 3, we evaluate SmolVLA on four real-world tasks. For the SO101 SO100 benchmark, the\nmodel is trained on a combination of three datasets, and success rates are reported per task as well as on average.”\n4. \"Table 11 indicates including state information in the VLM leads to significantly better performance for both the CA and SA variants.\" - from Table 11 prefix is better for CA (73.3->80.3), but worse for SA (74.8->53.3).","html":"

A couple of typos in the paper itself:

\n
    \n
  1. \"At ∼ 10M episodes frames, our pretraining set stands at least one order of magnitude smaller than other stateof-the-art.\"
  2. \n
  3. \"The datasets record trajectories relative to Unless (?) specified otherwise\" (something is missing)
  4. \n
  5. β€œIn Table 3, we evaluate SmolVLA on four real-world tasks. For the SO101 SO100 benchmark, the
    model is trained on a combination of three datasets, and success rates are reported per task as well as on average.”
  6. \n
  7. \"Table 11 indicates including state information in the VLM leads to significantly better performance for both the CA and SA variants.\" - from Table 11 prefix is better for CA (73.3->80.3), but worse for SA (74.8->53.3).
  8. \n
\n","updatedAt":"2025-06-24T12:51:01.491Z","author":{"_id":"665b0fbff37c86cf8a7d1038","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665b0fbff37c86cf8a7d1038/bkvvZHSbKCD1r63GCwoIx.jpeg","fullname":"Gleb Zarin","name":"zaringleb","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.902682900428772},"editors":["zaringleb"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/665b0fbff37c86cf8a7d1038/bkvvZHSbKCD1r63GCwoIx.jpeg"],"reactions":[{"reaction":"πŸ‘€","users":["danaaubakirova"],"count":1}],"isReport":false,"parentCommentId":"68545de8978793803824541c"}}]},{"id":"6863a0a49614e0c35afde2ce","author":{"_id":"6582083a083a3456669b87c5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6582083a083a3456669b87c5/4I-xEnDUU-4qlXfY41AvX.jpeg","fullname":"Flim de Jong","name":"Flimdejong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-07-01T08:47:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi SmolVLA team,\n\nAwesome work! Really cool how a small dataset of diverse community data can make such a difference.\n\nI was especially interested in your data curation process. From the paper, I saw that you used a VLM for annotations and mapped views by hand.\n- How scalable did you find this hybrid approach?\n- Were there any recurring pain points or bottlenecks during curation?\n\nAlso, generally speaking would you say curation is a major bottleneck/time sink when developing these models? I've been looking at the ARES project and was thinking of maybe forking it and writing a better front-end/ back-end stack and deploying it as a space, so we can improve all HF datasets on the hub.\n\nThanks again for your awesome work.","html":"

Hi SmolVLA team,

\n

Awesome work! Really cool how a small dataset of diverse community data can make such a difference.

\n

I was especially interested in your data curation process. From the paper, I saw that you used a VLM for annotations and mapped views by hand.

\n
    \n
  • How scalable did you find this hybrid approach?
  • \n
  • Were there any recurring pain points or bottlenecks during curation?
  • \n
\n

Also, generally speaking would you say curation is a major bottleneck/time sink when developing these models? I've been looking at the ARES project and was thinking of maybe forking it and writing a better front-end/ back-end stack and deploying it as a space, so we can improve all HF datasets on the hub.

\n

Thanks again for your awesome work.

\n","updatedAt":"2025-07-01T08:47:32.372Z","author":{"_id":"6582083a083a3456669b87c5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6582083a083a3456669b87c5/4I-xEnDUU-4qlXfY41AvX.jpeg","fullname":"Flim de Jong","name":"Flimdejong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9633826613426208},"editors":["Flimdejong"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6582083a083a3456669b87c5/4I-xEnDUU-4qlXfY41AvX.jpeg"],"reactions":[],"isReport":false}},{"id":"68676295fdc94c1b37d1cae3","author":{"_id":"65c8c05d0ec4a97f61873c38","avatarUrl":"/avatars/e64e0f0ce39276e073cec68f4801726f.svg","fullname":"Aadarsh Ram","name":"aadarshram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-07-04T05:11:49.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I had a question regarding the asynchronous inference process. I’m relatively new to this area, so apologies in advance if this is a naive doubt.\nFrom what I understand, the method allows the next inference cycle to begin while the action chunk from the previous inference is still being executed. Wouldn’t this introduce a mismatch in some casesβ€”where the system’s state has evolved significantly during the execution of the previous chunk, making the observation used for the next inference outdated or stale? In such situations, wouldn’t the resulting actions be suboptimal or even incorrect?\nPlease correct me if I’ve misunderstood something.\nThanks!","html":"

I had a question regarding the asynchronous inference process. I’m relatively new to this area, so apologies in advance if this is a naive doubt.
From what I understand, the method allows the next inference cycle to begin while the action chunk from the previous inference is still being executed. Wouldn’t this introduce a mismatch in some casesβ€”where the system’s state has evolved significantly during the execution of the previous chunk, making the observation used for the next inference outdated or stale? In such situations, wouldn’t the resulting actions be suboptimal or even incorrect?
Please correct me if I’ve misunderstood something.
Thanks!

\n","updatedAt":"2025-07-04T05:11:49.382Z","author":{"_id":"65c8c05d0ec4a97f61873c38","avatarUrl":"/avatars/e64e0f0ce39276e073cec68f4801726f.svg","fullname":"Aadarsh Ram","name":"aadarshram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9117949604988098},"editors":["aadarshram"],"editorAvatarUrls":["/avatars/e64e0f0ce39276e073cec68f4801726f.svg"],"reactions":[],"isReport":false},"replies":[{"id":"6867882a40dbb07fbe9d7720","author":{"_id":"63d67eac6f49aa8230601996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg","fullname":"Francesco Capuano","name":"fracapuano","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":141,"isUserFollowing":false},"createdAt":"2025-07-04T07:52:10.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Hey @aadarshram πŸ‘‹ Thank you very much for your question! Indeed your observation is spot on---_if_ the environment evolves significantly while the next action is being predicted, then the actions planned might be arbitrarly suboptimal (or even incorrect). However, models outputting \"action chunks\" (which are executed open-loop) natively deal with this problem, to which to your point our asynchronous inference stack might be more prone. \n\nI think it's worth noting we did not find such instances of \"high confusion\" failure modes in practice, and that aggregating (and not overriding, f(A_1, A_2) = A_2) different chunks provides a good mechanism to overcome this problem.","html":"

Hey \n\n@aadarshram\n\t πŸ‘‹ Thank you very much for your question! Indeed your observation is spot on---if the environment evolves significantly while the next action is being predicted, then the actions planned might be arbitrarly suboptimal (or even incorrect). However, models outputting \"action chunks\" (which are executed open-loop) natively deal with this problem, to which to your point our asynchronous inference stack might be more prone.

\n

I think it's worth noting we did not find such instances of \"high confusion\" failure modes in practice, and that aggregating (and not overriding, f(A_1, A_2) = A_2) different chunks provides a good mechanism to overcome this problem.

\n","updatedAt":"2025-07-04T07:52:33.116Z","author":{"_id":"63d67eac6f49aa8230601996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg","fullname":"Francesco Capuano","name":"fracapuano","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":141,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9417337775230408},"editors":["fracapuano"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"68676295fdc94c1b37d1cae3"}}]},{"id":"688b6ea5b79d3f494c2f920f","author":{"_id":"684ed8c8a32b95c4cdc8237a","avatarUrl":"/avatars/5cf982662564f6fe828b3b227b4baed4.svg","fullname":"Venkata Naga Kishan Munjulury","name":"mvnagakishan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-07-31T13:24:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi all,\nCould anyone help me how to integrate xarm environment with smolvla simulating a babbling task, please?","html":"

Hi all,
Could anyone help me how to integrate xarm environment with smolvla simulating a babbling task, please?

\n","updatedAt":"2025-07-31T13:24:53.760Z","author":{"_id":"684ed8c8a32b95c4cdc8237a","avatarUrl":"/avatars/5cf982662564f6fe828b3b227b4baed4.svg","fullname":"Venkata Naga Kishan Munjulury","name":"mvnagakishan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8848345875740051},"editors":["mvnagakishan"],"editorAvatarUrls":["/avatars/5cf982662564f6fe828b3b227b4baed4.svg"],"reactions":[],"isReport":false}},{"id":"68e0060354f7e89c824800cd","author":{"_id":"67e4a279fe1f5acc68fcdea4","avatarUrl":"/avatars/b464b5130c7867024bdd31456a1b1e51.svg","fullname":"hasith vattikuti","name":"hasithv","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-10-03T17:21:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Can you guys list out some recommended hardware for SmolVLA? The paper mentions it's lightweight enough to run on consumer grade gpus or even cpus, but doesn't mention any specific details","html":"

Can you guys list out some recommended hardware for SmolVLA? The paper mentions it's lightweight enough to run on consumer grade gpus or even cpus, but doesn't mention any specific details

\n","updatedAt":"2025-10-03T17:21:07.066Z","author":{"_id":"67e4a279fe1f5acc68fcdea4","avatarUrl":"/avatars/b464b5130c7867024bdd31456a1b1e51.svg","fullname":"hasith vattikuti","name":"hasithv","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9529207944869995},"editors":["hasithv"],"editorAvatarUrls":["/avatars/b464b5130c7867024bdd31456a1b1e51.svg"],"reactions":[],"isReport":false},"replies":[{"id":"68e2525f8a5f0071faf30d04","author":{"_id":"6582083a083a3456669b87c5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6582083a083a3456669b87c5/4I-xEnDUU-4qlXfY41AvX.jpeg","fullname":"Flim de Jong","name":"Flimdejong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-10-05T11:11:27.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I'm not sure if it will help you but I just got SmoLVLA running on a very average CPU (7735HS) and it works... kinda. You can really tell the model stops and calculates a chunk and then does the action. It's rather slow","html":"

I'm not sure if it will help you but I just got SmoLVLA running on a very average CPU (7735HS) and it works... kinda. You can really tell the model stops and calculates a chunk and then does the action. It's rather slow

\n","updatedAt":"2025-10-05T11:11:27.363Z","author":{"_id":"6582083a083a3456669b87c5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6582083a083a3456669b87c5/4I-xEnDUU-4qlXfY41AvX.jpeg","fullname":"Flim de Jong","name":"Flimdejong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9620457887649536},"editors":["Flimdejong"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6582083a083a3456669b87c5/4I-xEnDUU-4qlXfY41AvX.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"68e0060354f7e89c824800cd"}}]},{"id":"690073235f4e8eb5616436c4","author":{"_id":"67f730e780f117cb3ea435b0","avatarUrl":"/avatars/bfd3a26897e639d0f17f9f3e27676a60.svg","fullname":"hefasheng","name":"hefasheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-10-28T07:39:15.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"SmolVLA ,can it be used on both arms?","html":"

SmolVLA ,can it be used on both arms?

\n","updatedAt":"2025-10-28T07:41:02.830Z","author":{"_id":"67f730e780f117cb3ea435b0","avatarUrl":"/avatars/bfd3a26897e639d0f17f9f3e27676a60.svg","fullname":"hefasheng","name":"hefasheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"eo","probability":0.26634833216667175},"editors":["hefasheng"],"editorAvatarUrls":["/avatars/bfd3a26897e639d0f17f9f3e27676a60.svg"],"reactions":[],"isReport":false}},{"id":"694aee39a8d391e3fb396910","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-12-23T19:32:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXiv lens breakdown of this paper πŸ‘‰ https://arxivlens.com/PaperView/Details/smolvla-a-vision-language-action-model-for-affordable-and-efficient-robotics-5311-34af13eb\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"

arXiv lens breakdown of this paper πŸ‘‰ https://arxivlens.com/PaperView/Details/smolvla-a-vision-language-action-model-for-affordable-and-efficient-robotics-5311-34af13eb

\n
    \n
  • Executive Summary
  • \n
  • Detailed Breakdown
  • \n
  • Practical Applications
  • \n
\n","updatedAt":"2025-12-23T19:32:09.370Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6618669629096985},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2506.01844","authors":[{"_id":"683eb85825fcc99d2a7fc26d","user":{"_id":"62bdeedd01dc22b4d22a371e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62bdeedd01dc22b4d22a371e/ahbK9Ehurx1TgQAVw1TcS.jpeg","isPro":false,"fullname":"Mustafa Shukor","user":"mshukor","type":"user"},"name":"Mustafa Shukor","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:11:09.058Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc26e","user":{"_id":"640e21ef3c82bd463ee5a76d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640e21ef3c82bd463ee5a76d/nVR1DFPAsiLw6Boys28Rb.jpeg","isPro":false,"fullname":"Dana Aubakirova","user":"danaaubakirova","type":"user"},"name":"Dana Aubakirova","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:11:18.586Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc26f","user":{"_id":"63d67eac6f49aa8230601996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d67eac6f49aa8230601996/djvtWdy718whUgh7tu1Ko.jpeg","isPro":false,"fullname":"Francesco Capuano","user":"fracapuano","type":"user"},"name":"Francesco Capuano","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:11:29.639Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc270","user":{"_id":"65f9d37113336392bad1e49c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65f9d37113336392bad1e49c/B0Fxwconnu7lvtjBz4Ruq.jpeg","isPro":false,"fullname":"Pepijn Kooijmans","user":"pepijn223","type":"user"},"name":"Pepijn Kooijmans","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:11:38.693Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc271","user":{"_id":"67b124b081d4eae18b957606","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/CXvSv2l15uPkMQL_HBRDF.png","isPro":false,"fullname":"Steven Palma","user":"imstevenpmwork","type":"user"},"name":"Steven Palma","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:11:47.379Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc272","user":{"_id":"64c255b2254239173af0570a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c255b2254239173af0570a/xQtKvcQynqrIc52QgvICp.jpeg","isPro":false,"fullname":"Adil Zouitine","user":"AdilZtn","type":"user"},"name":"Adil Zouitine","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:11:56.957Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc273","user":{"_id":"668bd06dd58b51a628566d80","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668bd06dd58b51a628566d80/II7Yr5dT5ItMrpoMkQEy3.jpeg","isPro":false,"fullname":"Michel Aractingi","user":"aractingi","type":"user"},"name":"Michel Aractingi","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:12:05.798Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc274","user":{"_id":"67d7dea1786ddcb3af5a44b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67d7dea1786ddcb3af5a44b3/gEgXTH4oO91GIzjHR-yrb.png","isPro":false,"fullname":"Caroline Pascal","user":"CarolinePascal","type":"user"},"name":"Caroline Pascal","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:12:15.340Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc275","user":{"_id":"631365ad289cf15634c6f600","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631365ad289cf15634c6f600/bRndkmck1CZFKJr5U-p4A.png","isPro":false,"fullname":"Martino Russi","user":"nepyope","type":"user"},"name":"Martino Russi","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:12:23.474Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc276","user":{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","isPro":false,"fullname":"Andres Marafioti","user":"andito","type":"user"},"name":"Andres Marafioti","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:12:31.740Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc277","user":{"_id":"65fcb7f133a3d6f126772121","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65fcb7f133a3d6f126772121/BvVbNqnlQgDr2f_9dm5Es.jpeg","isPro":false,"fullname":"Simon Alibert","user":"aliberts","type":"user"},"name":"Simon Alibert","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:12:39.999Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc278","name":"Matthieu Cord","hidden":false},{"_id":"683eb85825fcc99d2a7fc279","user":{"_id":"5df7e9e5da6d0311fd3d53f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857746553-5df7e9e5da6d0311fd3d53f9.jpeg","isPro":true,"fullname":"Thomas Wolf","user":"thomwolf","type":"user"},"name":"Thomas Wolf","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:13:02.139Z","hidden":false},{"_id":"683eb85825fcc99d2a7fc27a","user":{"_id":"62f857fbb9fda55613ce80d9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62f857fbb9fda55613ce80d9/d7bRniKLmOt-iFN07k1Su.png","isPro":false,"fullname":"Remi Cadene","user":"cadene","type":"user"},"name":"Remi Cadene","status":"admin_assigned","statusLastChangedAt":"2025-06-03T09:13:11.882Z","hidden":false}],"publishedAt":"2025-06-02T16:30:19.000Z","submittedOnDailyAt":"2025-06-03T07:31:41.152Z","title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient\n Robotics","submittedOnDailyBy":{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","isPro":false,"fullname":"Andres Marafioti","user":"andito","type":"user"},"summary":"Vision-language models (VLMs) pretrained on large-scale multimodal datasets\nencode rich visual and linguistic knowledge, making them a strong foundation\nfor robotics. Rather than training robotic policies from scratch, recent\napproaches adapt VLMs into vision-language-action (VLA) models that enable\nnatural language-driven perception and control. However, existing VLAs are\ntypically massive--often with billions of parameters--leading to high training\ncosts and limited real-world deployability. Moreover, they rely on academic and\nindustrial datasets, overlooking the growing availability of\ncommunity-collected data from affordable robotic platforms. In this work, we\npresent SmolVLA, a small, efficient, and community-driven VLA that drastically\nreduces both training and inference costs, while retaining competitive\nperformance. SmolVLA is designed to be trained on a single GPU and deployed on\nconsumer-grade GPUs or even CPUs. To further improve responsiveness, we\nintroduce an asynchronous inference stack decoupling perception and action\nprediction from action execution, allowing higher control rates with chunked\naction generation. Despite its compact size, SmolVLA achieves performance\ncomparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both\nsimulated as well as real-world robotic benchmarks and release all code,\npretrained models, and training data.","upvotes":149,"discussionId":"683eb85925fcc99d2a7fc2dc","projectPage":"https://huggingface.co/blog/smolvla","githubRepo":"https://github.com/huggingface/lerobot","githubRepoAddedBy":"user","ai_summary":"SmolVLA is a compact, efficient vision-language-action model that achieves competitive performance at reduced computational costs and can be deployed on consumer-grade hardware.","ai_keywords":["vision-language models","multimodal datasets","robotic policies","vision-language-action models","natural language-driven perception","asynchronous inference","action prediction","action execution","chunked action generation","performance benchmarks"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6141a88b3a0ec78603c9e784","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg","isPro":true,"fullname":"merve","user":"merve","type":"user"},{"_id":"640e21ef3c82bd463ee5a76d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640e21ef3c82bd463ee5a76d/nVR1DFPAsiLw6Boys28Rb.jpeg","isPro":false,"fullname":"Dana Aubakirova","user":"danaaubakirova","type":"user"},{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","isPro":false,"fullname":"Andres Marafioti","user":"andito","type":"user"},{"_id":"60a551a34ecc5d054c8ad93e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60a551a34ecc5d054c8ad93e/dhcBFtwNLcKqqASxniyVw.jpeg","isPro":false,"fullname":"Mishig Davaadorj","user":"mishig","type":"user"},{"_id":"63d10d4e8eaa4831005e92b5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d10d4e8eaa4831005e92b5/7p7-OmWM6PqqCs7ZStPGD.jpeg","isPro":true,"fullname":"Aymeric Roucher","user":"m-ric","type":"user"},{"_id":"67ec4013d674ac4ba71dd264","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ec4013d674ac4ba71dd264/eY5KFSwHfiRY7MwS0P2PY.jpeg","isPro":false,"fullname":"JoΓ£o Palmeiro","user":"joaompalmeiro","type":"user"},{"_id":"608aabf24955d2bfc3cd99c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/-YxmtpzEmf3NKOTktODRP.jpeg","isPro":true,"fullname":"Aritra Roy Gosthipaty","user":"ariG23498","type":"user"},{"_id":"67691a1606a22560c1c9e04f","avatarUrl":"/avatars/a8fa65f1ead4bb41665fc59002de69f0.svg","isPro":false,"fullname":"ζŽζƒ³","user":"Ideal319","type":"user"},{"_id":"65f9d37113336392bad1e49c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65f9d37113336392bad1e49c/B0Fxwconnu7lvtjBz4Ruq.jpeg","isPro":false,"fullname":"Pepijn Kooijmans","user":"pepijn223","type":"user"},{"_id":"64c255b2254239173af0570a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c255b2254239173af0570a/xQtKvcQynqrIc52QgvICp.jpeg","isPro":false,"fullname":"Adil Zouitine","user":"AdilZtn","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":false,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"},{"_id":"668bd06dd58b51a628566d80","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668bd06dd58b51a628566d80/II7Yr5dT5ItMrpoMkQEy3.jpeg","isPro":false,"fullname":"Michel Aractingi","user":"aractingi","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
Papers
arxiv:2506.01844

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Published on Jun 2, 2025
Β· Submitted by
Andres Marafioti
on Jun 3, 2025
#2 Paper of the day

Abstract

SmolVLA is a compact, efficient vision-language-action model that achieves competitive performance at reduced computational costs and can be deployed on consumer-grade hardware.

AI-generated summary

Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.

Community

Paper author Paper submitter

SmolVLA is a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance.
Authors will be around so let's talk!

Β·

Wowow this is super cool! (sorry for low info comment)

Great read. Section 3 is a goldmine of its own.

Β·
Paper author

πŸ₯° thank you so much! πŸ€—

The paper states that model is trained on 4 GPU, corresponding to 30k gpu hours but it is equivalent as 30k/24/4=312 days. Is the number correct?

Β·

I asked the author the same question.
it's project's sum, which accounts for 100+ models trained due to architecture tweaking, hyperparameter tuning, ablations, and ofc testing.

Especially love the async inference contributions here. After trying to run Gr00t on a cloud GPU a few weeks back and experiencing the network latencies significantly impacting performance, I really appreciate the idea of parallelising inference with action execution.

I hope we see other VLAs adopting this architecture, it feels like a key step toward robots sharing cloud GPUs rather than depending on local hardware (reducing marginal cost & increasing maintainability!).

Β·
Paper author

Hey @willnorris thank you so much for your words---we're glad you liked the report, and async inference πŸ˜‰
We're hard at work to make sure the stack lands on main soon. It's already compatible with all the policy types LeRobot supports, and open-sourcing everything is our effort to make this the standard paradigm for the community. Why lagging? πŸ€“

If you're interested in following progress, check the PR here πŸ”— https://github.com/huggingface/lerobot/pull/1196

Paper author

Hey @willnorris thank you so much for your words---we're glad you liked the report, and async inference πŸ˜‰
We're hard at work to make sure the stack lands on main soon. It's already compatible with all the policy types LeRobot supports, and open-sourcing everything is our effort to make this the standard paradigm for the community. Why lagging? πŸ€“

If you're interested in following progress, check the PR here πŸ”— https://github.com/huggingface/lerobot/pull/1196

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Is only SO-100 execution data included in the dataset for pretraining?

Β·

yes, correct!

Hey guys,
Amazing work!

I think there is a typo: in "Improving Task Annotations" section, the link ref to Qwen/Qwen2.5-3B-Instruct (should be Qwen/Qwen2.5-VL-3B-Instruct)

Β·

Hello! Thank you!

Great catch! Yes, you are right the correct link ref is Qwen/Qwen2.5-VL-3B-Instruct.
Sorry about this!

Hi SmolVLA team,

Awesome work! Really cool how a small dataset of diverse community data can make such a difference.

I was especially interested in your data curation process. From the paper, I saw that you used a VLM for annotations and mapped views by hand.

  • How scalable did you find this hybrid approach?
  • Were there any recurring pain points or bottlenecks during curation?

Also, generally speaking would you say curation is a major bottleneck/time sink when developing these models? I've been looking at the ARES project and was thinking of maybe forking it and writing a better front-end/ back-end stack and deploying it as a space, so we can improve all HF datasets on the hub.

Thanks again for your awesome work.

I had a question regarding the asynchronous inference process. I’m relatively new to this area, so apologies in advance if this is a naive doubt.
From what I understand, the method allows the next inference cycle to begin while the action chunk from the previous inference is still being executed. Wouldn’t this introduce a mismatch in some casesβ€”where the system’s state has evolved significantly during the execution of the previous chunk, making the observation used for the next inference outdated or stale? In such situations, wouldn’t the resulting actions be suboptimal or even incorrect?
Please correct me if I’ve misunderstood something.
Thanks!

Β·

Hey @aadarshram πŸ‘‹ Thank you very much for your question! Indeed your observation is spot on---if the environment evolves significantly while the next action is being predicted, then the actions planned might be arbitrarly suboptimal (or even incorrect). However, models outputting "action chunks" (which are executed open-loop) natively deal with this problem, to which to your point our asynchronous inference stack might be more prone.

I think it's worth noting we did not find such instances of "high confusion" failure modes in practice, and that aggregating (and not overriding, f(A_1, A_2) = A_2) different chunks provides a good mechanism to overcome this problem.

Hi all,
Could anyone help me how to integrate xarm environment with smolvla simulating a babbling task, please?

Can you guys list out some recommended hardware for SmolVLA? The paper mentions it's lightweight enough to run on consumer grade gpus or even cpus, but doesn't mention any specific details

Β·

I'm not sure if it will help you but I just got SmoLVLA running on a very average CPU (7735HS) and it works... kinda. You can really tell the model stops and calculates a chunk and then does the action. It's rather slow

SmolVLA ,can it be used on both arms?

arXiv lens breakdown of this paper πŸ‘‰ https://arxivlens.com/PaperView/Details/smolvla-a-vision-language-action-model-for-affordable-and-efficient-robotics-5311-34af13eb

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 1,000+

Browse 1,000+ models citing this paper

Datasets citing this paper 9

Browse 9 datasets citing this paper

Spaces citing this paper 5

Collections including this paper 27