Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Causal World Modeling for Robot Control
[go: Go Back, main page]

\"Screenshot

\n","updatedAt":"2026-02-02T16:44:54.409Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9179,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4790867567062378},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"698152610a74b823e6539808","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-03T01:41:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs](https://huggingface.co/papers/2512.15692) (2025)\n* [Vidarc: Embodied Video Diffusion Model for Closed-loop Control](https://huggingface.co/papers/2512.17661) (2025)\n* [CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos](https://huggingface.co/papers/2601.04061) (2026)\n* [HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models](https://huggingface.co/papers/2512.09928) (2025)\n* [See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations](https://huggingface.co/papers/2512.07582) (2025)\n* [Robotic VLA Benefits from Joint Learning with Motion Image Diffusion](https://huggingface.co/papers/2512.18007) (2025)\n* [InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation](https://huggingface.co/papers/2601.02456) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-03T01:41:53.882Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6948646903038025},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.21998","authors":[{"_id":"6980d46983fdbbe1963c2ce6","user":{"_id":"660a6a412e58edd19e348eb3","avatarUrl":"/avatars/08aafe53ca81910621db6b6f99d1e835.svg","isPro":false,"fullname":"Lin Li","user":"lilinhitcrt","type":"user"},"name":"Lin Li","status":"claimed_verified","statusLastChangedAt":"2026-02-03T10:08:25.627Z","hidden":false},{"_id":"6980d46983fdbbe1963c2ce7","name":"Qihang Zhang","hidden":false},{"_id":"6980d46983fdbbe1963c2ce8","name":"Yiming Luo","hidden":false},{"_id":"6980d46983fdbbe1963c2ce9","user":{"_id":"64548f6c363bb3aaf9cba136","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64548f6c363bb3aaf9cba136/HqJL9HQ5CWVOJCsHyuMOm.jpeg","isPro":false,"fullname":"Shuai Yang","user":"ShuaiYang03","type":"user"},"name":"Shuai Yang","status":"claimed_verified","statusLastChangedAt":"2026-02-03T10:08:27.809Z","hidden":false},{"_id":"6980d46983fdbbe1963c2cea","name":"Ruilin Wang","hidden":false},{"_id":"6980d46983fdbbe1963c2ceb","name":"Fei Han","hidden":false},{"_id":"6980d46983fdbbe1963c2cec","name":"Mingrui Yu","hidden":false},{"_id":"6980d46983fdbbe1963c2ced","name":"Zelin Gao","hidden":false},{"_id":"6980d46983fdbbe1963c2cee","name":"Nan Xue","hidden":false},{"_id":"6980d46983fdbbe1963c2cef","name":"Xing Zhu","hidden":false},{"_id":"6980d46983fdbbe1963c2cf0","name":"Yujun Shen","hidden":false},{"_id":"6980d46983fdbbe1963c2cf1","user":{"_id":"6555da9adcf410fd0753569c","avatarUrl":"/avatars/ac58a796bb54f334fdb475a4b75c4d27.svg","isPro":false,"fullname":"Yinghao Xu","user":"justimyhxu","type":"user"},"name":"Yinghao Xu","status":"claimed_verified","statusLastChangedAt":"2026-02-04T12:33:21.040Z","hidden":false}],"publishedAt":"2026-01-29T17:07:43.000Z","submittedOnDailyAt":"2026-02-02T14:14:54.399Z","title":"Causal World Modeling for Robot Control","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.","upvotes":30,"discussionId":"6980d46983fdbbe1963c2cf2","ai_summary":"Video world modeling enables robot learning through a unified framework that predicts frames and executes policies simultaneously using a shared latent space and closed-loop feedback mechanisms.","ai_keywords":["video world modeling","autoregressive diffusion framework","frame prediction","policy execution","shared latent space","Mixture-of-Transformers","closed-loop rollout mechanism","asynchronous inference pipeline","long-horizon manipulation","data efficiency","generalizability"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"660a6a412e58edd19e348eb3","avatarUrl":"/avatars/08aafe53ca81910621db6b6f99d1e835.svg","isPro":false,"fullname":"Lin Li","user":"lilinhitcrt","type":"user"},{"_id":"676cb7326fb487638321a646","avatarUrl":"/avatars/75a532c6be4a3faba30987b9a8d0af61.svg","isPro":false,"fullname":"Zhang","user":"soda1126","type":"user"},{"_id":"63047acc41387c7f11756284","avatarUrl":"/avatars/a184fca1ee2d98b81e1917aecc3cdca7.svg","isPro":false,"fullname":"i3h","user":"qihang","type":"user"},{"_id":"65fa86c8180d06d37c8f5c71","avatarUrl":"/avatars/21913d9a1d7585f73bd8a559520a3d8c.svg","isPro":false,"fullname":"Shuaiting Li","user":"list0830","type":"user"},{"_id":"664d9d1eafe6e8c3a986ef67","avatarUrl":"/avatars/88535076724e97f412ecd362a38904db.svg","isPro":false,"fullname":"hyzhou","user":"hyzhou404","type":"user"},{"_id":"64548f6c363bb3aaf9cba136","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64548f6c363bb3aaf9cba136/HqJL9HQ5CWVOJCsHyuMOm.jpeg","isPro":false,"fullname":"Shuai Yang","user":"ShuaiYang03","type":"user"},{"_id":"649958942ca6f96c8b8c1076","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649958942ca6f96c8b8c1076/olfLIqNryaog1nAnQPwkN.jpeg","isPro":false,"fullname":"Yudong Jin","user":"krahets","type":"user"},{"_id":"685631e960baba83d051fe31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/H_Yw7qdpX_Z3Bn9wkwSYG.png","isPro":false,"fullname":"LIn-Zhuo Chen","user":"Lin-Zhuo","type":"user"},{"_id":"674ec2a7b094645555cceb87","avatarUrl":"/avatars/278cc8a5f8fe9d37dcfe8188c6b3356f.svg","isPro":false,"fullname":"清川时","user":"BigPikachu","type":"user"},{"_id":"64c903957b4d0d947ce86bc6","avatarUrl":"/avatars/61d70a3ba00c83a5950f5c909a1a06f8.svg","isPro":false,"fullname":"Yuze He","user":"hyz317","type":"user"},{"_id":"6575702b15b1ca184b0b2700","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6575702b15b1ca184b0b2700/O9cEodqQmG-gyqMiO_edR.jpeg","isPro":false,"fullname":"Zaibin Zhang","user":"MrBean2024","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2601.21998

Causal World Modeling for Robot Control

Published on Jan 29
· Submitted by
AK
on Feb 2
Authors:
Lin Li ,
,
,
,
,
,
,
,
,
,

Abstract

Video world modeling enables robot learning through a unified framework that predicts frames and executes policies simultaneously using a shared latent space and closed-loop feedback mechanisms.

AI-generated summary

This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.

Community

Paper submitter

Screenshot 2026-02-02 at 11.44.42 AM

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.21998 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.21998 in a Space README.md to link it from this page.

Collections including this paper 2