Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Towards Physically Plausible Video Generation via VLM Planning
[go: Go Back, main page]

https://madaoer.github.io/projects/physically_plausible_video_generation/
Paper Page:https://arxiv.org/abs/2503.23368

\n","updatedAt":"2025-04-07T02:09:44.130Z","author":{"_id":"652cb3e1c6857682d30d4c2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/JEQW6WZBcmjAWwODiDkYA.jpeg","fullname":"Libaolu","name":"8ruceLi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":3,"identifiedLanguage":{"language":"en","probability":0.8237813711166382},"editors":["8ruceLi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/JEQW6WZBcmjAWwODiDkYA.jpeg"],"reactions":[{"reaction":"🔥","users":["8ruceLi","ownt","zh1ym","Qinghew"],"count":4}],"isReport":false}},{"id":"67ee720aca01364d419eea2c","author":{"_id":"64c475a388373ea6200849a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/lfajaYabjbzDiBWdDzGBE.jpeg","fullname":"Oooowi","name":"ZiruiZheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false},"createdAt":"2025-04-03T11:33:30.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"wonderful work !","html":"

wonderful work !

\n","updatedAt":"2025-04-03T11:33:30.152Z","author":{"_id":"64c475a388373ea6200849a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/lfajaYabjbzDiBWdDzGBE.jpeg","fullname":"Oooowi","name":"ZiruiZheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5503985285758972},"editors":["ZiruiZheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/lfajaYabjbzDiBWdDzGBE.jpeg"],"reactions":[{"reaction":"❤️","users":["8ruceLi","ownt","JeremyYin","zh1ym","Qinghew"],"count":5}],"isReport":false}},{"id":"67ef3771ea7880a52f04d3aa","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-04T01:35:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Exploring the Evolution of Physics Cognition in Video Generation: A Survey](https://huggingface.co/papers/2503.21765) (2025)\n* [PoseTraj: Pose-Aware Trajectory Control in Video Diffusion](https://huggingface.co/papers/2503.16068) (2025)\n* [C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation](https://huggingface.co/papers/2502.19868) (2025)\n* [Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach](https://huggingface.co/papers/2502.03639) (2025)\n* [A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction](https://huggingface.co/papers/2502.05503) (2025)\n* [MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance](https://huggingface.co/papers/2503.16421) (2025)\n* [WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation](https://huggingface.co/papers/2503.08153) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-04T01:35:45.728Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6688869595527649},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"❤️","users":["8ruceLi"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.23368","authors":[{"_id":"67ee6cee153d50e65e63d024","user":{"_id":"6760e5fcf3591a93356e9a00","avatarUrl":"/avatars/e4ab60ee6fa498170c901b3c4599ef4d.svg","isPro":false,"fullname":"Xindi Yang","user":"Madaoer-Otaku","type":"user"},"name":"Xindi Yang","status":"admin_assigned","statusLastChangedAt":"2025-04-03T13:55:35.772Z","hidden":false},{"_id":"67ee6cee153d50e65e63d025","user":{"_id":"652cb3e1c6857682d30d4c2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/JEQW6WZBcmjAWwODiDkYA.jpeg","isPro":false,"fullname":"Libaolu","user":"8ruceLi","type":"user"},"name":"Baolu Li","status":"claimed_verified","statusLastChangedAt":"2025-04-03T13:32:52.747Z","hidden":false},{"_id":"67ee6cee153d50e65e63d026","user":{"_id":"66f2748bfa0f7d682326c834","avatarUrl":"/avatars/cbd0e257ec752d803249d5422f7f7998.svg","isPro":false,"fullname":"Yiming Zhang","user":"zh1ym","type":"user"},"name":"Yiming Zhang","status":"claimed_verified","statusLastChangedAt":"2025-04-08T06:56:52.511Z","hidden":false},{"_id":"67ee6cee153d50e65e63d027","user":{"_id":"64e314ad24809d7fa0f20fbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/bHE0w_hjDFvU-Aul0_E7g.jpeg","isPro":false,"fullname":"Zhenfei Yin","user":"JeremyYin","type":"user"},"name":"Zhenfei Yin","status":"admin_assigned","statusLastChangedAt":"2025-04-03T13:56:12.459Z","hidden":false},{"_id":"67ee6cee153d50e65e63d028","name":"Lei Bai","hidden":false},{"_id":"67ee6cee153d50e65e63d029","user":{"_id":"645b3f63f9d4ec91fdd5129f","avatarUrl":"/avatars/e31b57294cd6622b42a96d5f83e95495.svg","isPro":false,"fullname":"Ma Liqian","user":"Joyhunter9","type":"user"},"name":"Liqian Ma","status":"admin_assigned","statusLastChangedAt":"2025-04-03T13:56:36.706Z","hidden":false},{"_id":"67ee6cee153d50e65e63d02a","name":"Zhiyong Wang","hidden":false},{"_id":"67ee6cee153d50e65e63d02b","user":{"_id":"656fda082f058b368c27d1b9","avatarUrl":"/avatars/cba33f6b20fc93ac9d3d579cbd7f839a.svg","isPro":false,"fullname":"cai jianfei","user":"genify","type":"user"},"name":"Jianfei Cai","status":"admin_assigned","statusLastChangedAt":"2025-04-03T13:56:25.898Z","hidden":false},{"_id":"67ee6cee153d50e65e63d02c","user":{"_id":"65574f0fc4865c852d5eec15","avatarUrl":"/avatars/1e03db4f2de4959dee620c577fbbb063.svg","isPro":false,"fullname":"Tien-Tsin Wong","user":"ttwong","type":"user"},"name":"Tien-Tsin Wong","status":"admin_assigned","statusLastChangedAt":"2025-04-03T13:56:52.247Z","hidden":false},{"_id":"67ee6cee153d50e65e63d02d","name":"Huchuan Lu","hidden":false},{"_id":"67ee6cee153d50e65e63d02e","name":"Xu Jia","hidden":false}],"publishedAt":"2025-03-30T09:03:09.000Z","submittedOnDailyAt":"2025-04-03T09:42:46.004Z","title":"Towards Physically Plausible Video Generation via VLM Planning","submittedOnDailyBy":{"_id":"652cb3e1c6857682d30d4c2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/JEQW6WZBcmjAWwODiDkYA.jpeg","isPro":false,"fullname":"Libaolu","user":"8ruceLi","type":"user"},"summary":"Video diffusion models (VDMs) have advanced significantly in recent years,\nenabling the generation of highly realistic videos and drawing the attention of\nthe community in their potential as world simulators. However, despite their\ncapabilities, VDMs often fail to produce physically plausible videos due to an\ninherent lack of understanding of physics, resulting in incorrect dynamics and\nevent sequences. To address this limitation, we propose a novel two-stage\nimage-to-video generation framework that explicitly incorporates physics. In\nthe first stage, we employ a Vision Language Model (VLM) as a coarse-grained\nmotion planner, integrating chain-of-thought and physics-aware reasoning to\npredict a rough motion trajectories/changes that approximate real-world\nphysical dynamics while ensuring the inter-frame consistency. In the second\nstage, we use the predicted motion trajectories/changes to guide the video\ngeneration of a VDM. As the predicted motion trajectories/changes are rough,\nnoise is added during inference to provide freedom to the VDM in generating\nmotion with more fine details. Extensive experimental results demonstrate that\nour framework can produce physically plausible motion, and comparative\nevaluations highlight the notable superiority of our approach over existing\nmethods. More video results are available on our Project Page:\nhttps://madaoer.github.io/projects/physically_plausible_video_generation.","upvotes":40,"discussionId":"67ee6cf0153d50e65e63d093","projectPage":"https://madaoer.github.io/projects/physically_plausible_video_generation/","githubRepo":"https://github.com/Madaoer/VLIPP","githubRepoAddedBy":"auto","ai_summary":"A two-stage framework combines a Vision Language Model for coarse motion planning with a Video Diffusion Model to generate physically plausible videos.","ai_keywords":["Video diffusion models","VDMs","Vision Language Model","VLM","chain-of-thought reasoning","physics-aware reasoning","motion trajectories","inter-frame consistency","noise addition","video generation"],"githubStars":48},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6695e4ff490f463859ddc555","avatarUrl":"/avatars/b51e4cfa571a65789f9d1534de22ac7f.svg","isPro":false,"fullname":"Hector Lance","user":"Madao-Flere210","type":"user"},{"_id":"646f3418a6a58aa29505fd30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646f3418a6a58aa29505fd30/1z13rnpb6rsUgQsYumWPg.png","isPro":false,"fullname":"QINGHE WANG","user":"Qinghew","type":"user"},{"_id":"6760e5fcf3591a93356e9a00","avatarUrl":"/avatars/e4ab60ee6fa498170c901b3c4599ef4d.svg","isPro":false,"fullname":"Xindi Yang","user":"Madaoer-Otaku","type":"user"},{"_id":"66f2748bfa0f7d682326c834","avatarUrl":"/avatars/cbd0e257ec752d803249d5422f7f7998.svg","isPro":false,"fullname":"Yiming Zhang","user":"zh1ym","type":"user"},{"_id":"673ad30f4271de2b747b19bf","avatarUrl":"/avatars/69b8b29a048d108e65c5cce2e051a128.svg","isPro":false,"fullname":"lsc","user":"ustclsc","type":"user"},{"_id":"67ee6f8e43689d02b2772149","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cKwk-yBmQBpu0Si8XxnLb.png","isPro":false,"fullname":"gengpei qi","user":"jialiwuzhuzi","type":"user"},{"_id":"67ee7035c9ea906511fa6426","avatarUrl":"/avatars/373723ae2fd92a4388e55c34ef018e99.svg","isPro":false,"fullname":"Ming-make","user":"Ming-make","type":"user"},{"_id":"64c475a388373ea6200849a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/lfajaYabjbzDiBWdDzGBE.jpeg","isPro":false,"fullname":"Oooowi","user":"ZiruiZheng","type":"user"},{"_id":"644d548b16703fd670291469","avatarUrl":"/avatars/f270f8ff85118066df52dd55a7c6b523.svg","isPro":false,"fullname":"zhihe","user":"qian5","type":"user"},{"_id":"64706f4122d12f52949b0117","avatarUrl":"/avatars/c6c7e89fbe1bbbf40e3871d5e4cfc3b4.svg","isPro":false,"fullname":"LITAIQING","user":"LiTaiQing","type":"user"},{"_id":"6645a4176a1c3c6f14135e07","avatarUrl":"/avatars/54031291c3f978cace96f6d9c50f5815.svg","isPro":false,"fullname":"Hefei Huang","user":"iamxiaohuang","type":"user"},{"_id":"67ee782979018bf61e2522a4","avatarUrl":"/avatars/819318421eb40b9ba7e4054770823a8a.svg","isPro":false,"fullname":"HaoYin","user":"yyzqy","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2503.23368

Towards Physically Plausible Video Generation via VLM Planning

Published on Mar 30, 2025
· Submitted by
Libaolu
on Apr 3, 2025
Authors:
,
,
,

Abstract

A two-stage framework combines a Vision Language Model for coarse motion planning with a Video Diffusion Model to generate physically plausible videos.

AI-generated summary

Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.

Community

Paper author Paper submitter
edited Apr 7, 2025

We propose a novel two-stage approach to incorporate physics as conditions into Video Diffusion Models, enabling the generation of physically plausible motion.
Our framework outperforms existing methods and achieves satisfactory results within two major physics benchmarks. By incorporating physical priors, our framework unleashes the potential for video diffusion models to serve as world simulators.
Project Page:https://madaoer.github.io/projects/physically_plausible_video_generation/
Paper Page:https://arxiv.org/abs/2503.23368

wonderful work !

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.23368 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.23368 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.23368 in a Space README.md to link it from this page.

Collections including this paper 5