Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-10T01:34:22.427Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6323785781860352},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.02792","authors":[{"_id":"67f713dbb27f2c52f771fcb5","name":"Chuning Zhu","hidden":false},{"_id":"67f713dbb27f2c52f771fcb6","user":{"_id":"66fc85c98f3b78428e37be61","avatarUrl":"/avatars/4b026e4d13e3d6f910533e84c2de040b.svg","isPro":false,"fullname":"Raymond Yu","user":"raymondyu","type":"user"},"name":"Raymond Yu","status":"claimed_verified","statusLastChangedAt":"2025-04-11T07:26:05.361Z","hidden":false},{"_id":"67f713dbb27f2c52f771fcb7","name":"Siyuan Feng","hidden":false},{"_id":"67f713dbb27f2c52f771fcb8","name":"Benjamin Burchfiel","hidden":false},{"_id":"67f713dbb27f2c52f771fcb9","name":"Paarth Shah","hidden":false},{"_id":"67f713dbb27f2c52f771fcba","name":"Abhishek Gupta","hidden":false}],"publishedAt":"2025-04-03T17:38:59.000Z","submittedOnDailyAt":"2025-04-09T23:12:55.499Z","title":"Unified World Models: Coupling Video and Action Diffusion for\n Pretraining on Large Robotic Datasets","submittedOnDailyBy":{"_id":"6494c76b3f3a5da8043e1198","avatarUrl":"/avatars/3af6620b938dd9dc7079c893b7a23768.svg","isPro":false,"fullname":"Chuning Zhu","user":"zchuning","type":"user"},"summary":"Imitation learning has emerged as a promising approach towards building\ngeneralist robots. However, scaling imitation learning for large robot\nfoundation models remains challenging due to its reliance on high-quality\nexpert demonstrations. Meanwhile, large amounts of video data depicting a wide\nrange of environments and diverse behaviors are readily available. This data\nprovides a rich source of information about real-world dynamics and\nagent-environment interactions. Leveraging this data directly for imitation\nlearning, however, has proven difficult due to the lack of action annotation\nrequired for most contemporary methods. In this work, we present Unified World\nModels (UWM), a framework that allows for leveraging both video and action data\nfor policy learning. Specifically, a UWM integrates an action diffusion process\nand a video diffusion process within a unified transformer architecture, where\nindependent diffusion timesteps govern each modality. We show that by simply\ncontrolling each diffusion timestep, UWM can flexibly represent a policy, a\nforward dynamics, an inverse dynamics, and a video generator. Through simulated\nand real-world experiments, we show that: (1) UWM enables effective pretraining\non large-scale multitask robot datasets with both dynamics and action\npredictions, resulting in more generalizable and robust policies than imitation\nlearning, (2) UWM naturally facilitates learning from action-free video data\nthrough independent control of modality-specific diffusion timesteps, further\nimproving the performance of finetuned policies. Our results suggest that UWM\noffers a promising step toward harnessing large, heterogeneous datasets for\nscalable robot learning, and provides a simple unification between the often\ndisparate paradigms of imitation learning and world modeling. Videos and code\nare available at https://weirdlabuw.github.io/uwm/.","upvotes":4,"discussionId":"67f713ddb27f2c52f771fd3d","projectPage":"https://weirdlabuw.github.io/uwm/","githubRepo":"https://github.com/WEIRDLabUW/unified-world-model","githubRepoAddedBy":"user","ai_summary":"Unified World Models leverage both video and action data within a unified transformer architecture to enable effective pretraining on large-scale multitask robot datasets and improve policy generalization and robustness.","ai_keywords":["Unified World Models (UWM)","action diffusion process","video diffusion process","unified transformer architecture","diffusion timesteps","policy learning","forward dynamics","inverse dynamics","video generator","pretraining","multitask robot datasets","policy generalization","policy robustness"],"githubStars":188},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6494c76b3f3a5da8043e1198","avatarUrl":"/avatars/3af6620b938dd9dc7079c893b7a23768.svg","isPro":false,"fullname":"Chuning Zhu","user":"zchuning","type":"user"},{"_id":"66fc85c98f3b78428e37be61","avatarUrl":"/avatars/4b026e4d13e3d6f910533e84c2de040b.svg","isPro":false,"fullname":"Raymond Yu","user":"raymondyu","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"634aab35dcf125e4dafc87b1","avatarUrl":"/avatars/aa2db84fd423e9eefe3ef3167c9d3999.svg","isPro":false,"fullname":"YangXiuyu","user":"gzzyyxy","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2504.02792

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Published on Apr 3, 2025
· Submitted by
Chuning Zhu
on Apr 9, 2025
Authors:
,
,
,
,

Abstract

Unified World Models leverage both video and action data within a unified transformer architecture to enable effective pretraining on large-scale multitask robot datasets and improve policy generalization and robustness.

AI-generated summary

Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation required for most contemporary methods. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. We show that by simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.

Community

Paper submitter

Unified World Model (UWM) is a multimodal diffusion transformer that uses separate diffusion timesteps for actions and videos to flexibly learn policies, forward dynamics, inverse dynamics, and video prediction models from both robot and video data.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.02792 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.02792 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.02792 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.