Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
[go: Go Back, main page]

https://hanyang-21.github.io/VideoScene
GitHub Code: https://github.com/hanyang-21/VideoScene

\n","updatedAt":"2025-04-03T03:37:36.726Z","author":{"_id":"65c38f6c137aba2aee524989","avatarUrl":"/avatars/a93e29f55876df3e65e2532972e057e4.svg","fullname":"Hanyang Wang","name":"hanyang-21","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5530371069908142},"editors":["hanyang-21"],"editorAvatarUrls":["/avatars/a93e29f55876df3e65e2532972e057e4.svg"],"reactions":[],"isReport":false}},{"id":"67ef3721c1e251f2394334d3","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-04T01:34:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow](https://huggingface.co/papers/2502.11697) (2025)\n* [FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis](https://huggingface.co/papers/2503.13265) (2025)\n* [GenFusion: Closing the Loop between Reconstruction and Generation via Videos](https://huggingface.co/papers/2503.21219) (2025)\n* [Can Video Diffusion Model Reconstruct 4D Geometry?](https://huggingface.co/papers/2503.21082) (2025)\n* [Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs](https://huggingface.co/papers/2503.05082) (2025)\n* [High-Fidelity Novel View Synthesis via Splatting-Guided Diffusion](https://huggingface.co/papers/2502.12752) (2025)\n* [Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model](https://huggingface.co/papers/2503.22622) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-04T01:34:25.356Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6731394529342651},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.01956","authors":[{"_id":"67ee01265839c8a023344aee","user":{"_id":"65c38f6c137aba2aee524989","avatarUrl":"/avatars/a93e29f55876df3e65e2532972e057e4.svg","isPro":false,"fullname":"Hanyang Wang","user":"hanyang-21","type":"user"},"name":"Hanyang Wang","status":"claimed_verified","statusLastChangedAt":"2025-04-03T07:52:06.284Z","hidden":false},{"_id":"67ee01265839c8a023344aef","user":{"_id":"6505a02f9310ce8c400edc63","avatarUrl":"/avatars/bbf781594fc8c812316711aa8e2797aa.svg","isPro":false,"fullname":"Fangfu Liu","user":"Liuff23","type":"user"},"name":"Fangfu Liu","status":"admin_assigned","statusLastChangedAt":"2025-04-03T13:46:26.790Z","hidden":false},{"_id":"67ee01265839c8a023344af0","user":{"_id":"66e14cf2fb009ef598305fe5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/etohfcD9GKaRPir1HRP-5.jpeg","isPro":false,"fullname":"Jiawei Chi","user":"chijw","type":"user"},"name":"Jiawei Chi","status":"admin_assigned","statusLastChangedAt":"2025-04-03T13:46:47.647Z","hidden":false},{"_id":"67ee01265839c8a023344af1","user":{"_id":"66c8131afafc0fc87ca99650","avatarUrl":"/avatars/a6eeba2ccf011d5c9964fd38f85bd671.svg","isPro":false,"fullname":"Yueqi Duan","user":"duanyueqi","type":"user"},"name":"Yueqi Duan","status":"admin_assigned","statusLastChangedAt":"2025-04-03T13:46:55.497Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65c38f6c137aba2aee524989/JKAKb_7rnf6eZT56AF6aM.mp4"],"publishedAt":"2025-04-02T17:59:21.000Z","submittedOnDailyAt":"2025-04-03T02:07:36.716Z","title":"VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in\n One Step","submittedOnDailyBy":{"_id":"65c38f6c137aba2aee524989","avatarUrl":"/avatars/a93e29f55876df3e65e2532972e057e4.svg","isPro":false,"fullname":"Hanyang Wang","user":"hanyang-21","type":"user"},"summary":"Recovering 3D scenes from sparse views is a challenging task due to its\ninherent ill-posed problem. Conventional methods have developed specialized\nsolutions (e.g., geometry regularization or feed-forward deterministic model)\nto mitigate the issue. However, they still suffer from performance degradation\nby minimal overlap across input views with insufficient visual information.\nFortunately, recent video generative models show promise in addressing this\nchallenge as they are capable of generating video clips with plausible 3D\nstructures. Powered by large pretrained video diffusion models, some pioneering\nresearch start to explore the potential of video generative prior and create 3D\nscenes from sparse views. Despite impressive improvements, they are limited by\nslow inference time and the lack of 3D constraint, leading to inefficiencies\nand reconstruction artifacts that do not align with real-world geometry\nstructure. In this paper, we propose VideoScene to distill the video diffusion\nmodel to generate 3D scenes in one step, aiming to build an efficient and\neffective tool to bridge the gap from video to 3D. Specifically, we design a\n3D-aware leap flow distillation strategy to leap over time-consuming redundant\ninformation and train a dynamic denoising policy network to adaptively\ndetermine the optimal leap timestep during inference. Extensive experiments\ndemonstrate that our VideoScene achieves faster and superior 3D scene\ngeneration results than previous video diffusion models, highlighting its\npotential as an efficient tool for future video to 3D applications. Project\nPage: https://hanyang-21.github.io/VideoScene","upvotes":41,"discussionId":"67ee012a5839c8a023344bdb","projectPage":"https://hanyang-21.github.io/VideoScene","githubRepo":"https://github.com/hanyang-21/VideoScene","githubRepoAddedBy":"user","ai_summary":"VideoScene distills video diffusion models for efficient one-step 3D scene generation from sparse views, enhancing speed and accuracy compared to previous methods.","ai_keywords":["video diffusion models","3D scenes","sparse views","leap flow distillation","dynamic denoising policy network","3D-aware","video generative models"],"githubStars":341},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65c38f6c137aba2aee524989","avatarUrl":"/avatars/a93e29f55876df3e65e2532972e057e4.svg","isPro":false,"fullname":"Hanyang Wang","user":"hanyang-21","type":"user"},{"_id":"6505a02f9310ce8c400edc63","avatarUrl":"/avatars/bbf781594fc8c812316711aa8e2797aa.svg","isPro":false,"fullname":"Fangfu Liu","user":"Liuff23","type":"user"},{"_id":"676c04f44464f476aaa53d1c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/k488J1893F3JGwEMvaeuh.png","isPro":false,"fullname":"Chong Xia","user":"xiac24","type":"user"},{"_id":"66e14cf2fb009ef598305fe5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/etohfcD9GKaRPir1HRP-5.jpeg","isPro":false,"fullname":"Jiawei Chi","user":"chijw","type":"user"},{"_id":"673aaca7d25254c3160bab7f","avatarUrl":"/avatars/0d37c18ba3cbbc987f8e9b5120389cc9.svg","isPro":false,"fullname":"James Cai","user":"JamesTHU","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"66aa94cbd59743aa4a65646f","avatarUrl":"/avatars/6a59a1a9ac2b509cb93d679111f29d10.svg","isPro":false,"fullname":"Yao Runmao","user":"yaorunmao","type":"user"},{"_id":"656ee8008bb9f4f8d95bd8f7","avatarUrl":"/avatars/4069d70f1279d928da521211c495d638.svg","isPro":false,"fullname":"Hyeonho Jeong","user":"hyeonho-jeong-video","type":"user"},{"_id":"67ee368b0f6ee9ad85bce4b7","avatarUrl":"/avatars/3f2e3003470ded76f5fd4f9e071890f3.svg","isPro":false,"fullname":"Eric Ward","user":"Eric-21","type":"user"},{"_id":"67ee37286f8fb019176cc0e2","avatarUrl":"/avatars/fc9200dd44f2be2cb75ea30bc131f4bf.svg","isPro":false,"fullname":"Hanyang Wang","user":"THUer","type":"user"},{"_id":"67ee37225d91dad54f42578c","avatarUrl":"/avatars/80ae3fb8f498c0252fe892db7a61ba31.svg","isPro":false,"fullname":"Zhaohaotian","user":"Zhaohaotian21","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2504.01956

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Published on Apr 2, 2025
· Submitted by
Hanyang Wang
on Apr 3, 2025

Abstract

VideoScene distills video diffusion models for efficient one-step 3D scene generation from sparse views, enhancing speed and accuracy compared to previous methods.

AI-generated summary

Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene

Community

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.01956 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.01956 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.01956 in a Space README.md to link it from this page.

Collections including this paper 9