Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
[go: Go Back, main page]

@froilo\n\t here's the repo: https://github.com/HL-hanlin/VideoDirectorGPT

\n

So generating coherent videos spanning multiple scenes from text descriptions is hard with AI right now. You can make short clips easily but, smoothly transitioning across diverse events and maintaining continuity is the hard part.

\n

In this paper from UNC Chapel Hill, the authors propose VIDEODIRECTORGPT, a two-stage framework attempting to address multi-scene video generation:

\n

Here are my highlights from the paper:

\n
    \n
  • Two-stage approach: language model generates detailed \"video plan\", then video generation module renders scenes based on plan
  • \n
  • Video plan contains multi-scene descriptions, entities/layouts, backgrounds, consistency groupings - guides downstream video generation
  • \n
  • Video generation module called Layout2Vid trained on images, adds spatial layout control and cross-scene consistency to existing text-to-video model
  • \n
  • Experiments show improved object layout/control in single scene videos vs baselines
  • \n
  • Multi-scene videos display higher object consistency across scenes compared to baselines
  • \n
  • Competitive open-domain video generation performance maintained
  • \n
\n

The key innovation seems to be using a large language model to plot detailed video plans to guide overall video generation. And the video generator Layout2Vid adds better spatial and temporal control through some clever tweaks.

\n

You can read my full summary here.

\n","updatedAt":"2023-09-27T18:02:45.583Z","author":{"_id":"6486638da4cf2081f20c40ec","avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8050872087478638},"editors":["mikelabs"],"editorAvatarUrls":["/avatars/0bc16a7447cd71ac18828a678313bd83.svg"],"reactions":[{"reaction":"🤯","users":["mikelabs"],"count":1}],"isReport":false}},{"id":"651ac8ddf0354540aa1d1a4f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2023-10-02T13:42:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation](https://huggingface.co/papers/2309.03549) (2023)\n* [StoryBench: A Multifaceted Benchmark for Continuous Story Visualization](https://huggingface.co/papers/2308.11606) (2023)\n* [LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models](https://huggingface.co/papers/2309.15103) (2023)\n* [Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models](https://huggingface.co/papers/2308.13812) (2023)\n* [VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation](https://huggingface.co/papers/2309.00398) (2023)\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n","updatedAt":"2023-10-02T13:42:53.033Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6817718744277954},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"68d87aad0576f6166c166c2e","author":{"_id":"6503fbd10905dd866fdf51b7","avatarUrl":"/avatars/70c2e26b939dc173e59d676d3c2c5e40.svg","fullname":"Ivan","name":"Ivankilin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-09-28T00:00:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"A curious alien with smooth, pale blue skin, large reflective silver eyes, and a slender frame arrives in a quiet, sunlit suburban neighborhood on Earth. He finds a **raincoat** on a bench. Puts it on — too big, sleeves drag. Zips it all the way up. Stands tall. The wind fills it like a balloon. He floats slightly, arms wide — like a human-shaped kite.\nStyle: Realistic, cinematic, 4K, natural lighting, subtle humor, gentle music in the background. Focus on facial expressions and precise movements. Придумай что-нибудь еще и улучши Промт нюпод свою систему","html":"

A curious alien with smooth, pale blue skin, large reflective silver eyes, and a slender frame arrives in a quiet, sunlit suburban neighborhood on Earth. He finds a raincoat on a bench. Puts it on — too big, sleeves drag. Zips it all the way up. Stands tall. The wind fills it like a balloon. He floats slightly, arms wide — like a human-shaped kite.
Style: Realistic, cinematic, 4K, natural lighting, subtle humor, gentle music in the background. Focus on facial expressions and precise movements. Придумай что-нибудь еще и улучши Промт нюпод свою систему

\n","updatedAt":"2025-09-28T00:00:45.999Z","author":{"_id":"6503fbd10905dd866fdf51b7","avatarUrl":"/avatars/70c2e26b939dc173e59d676d3c2c5e40.svg","fullname":"Ivan","name":"Ivankilin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5809401869773865},"editors":["Ivankilin"],"editorAvatarUrls":["/avatars/70c2e26b939dc173e59d676d3c2c5e40.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2309.15091","authors":[{"_id":"651382954483b69098d5b406","user":{"_id":"646e86350867c99c2d3f2ecf","avatarUrl":"/avatars/b89798ff623abffb169eacda2ac32fde.svg","isPro":true,"fullname":"Han Lin","user":"hanlincs","type":"user"},"name":"Han Lin","status":"claimed_verified","statusLastChangedAt":"2023-09-27T18:25:03.081Z","hidden":false},{"_id":"651382954483b69098d5b407","user":{"_id":"6361794819e7e4bf45741586","avatarUrl":"/avatars/e4570b6ce752dc2b9b42969e21a50b57.svg","isPro":false,"fullname":"Abhay Zala","user":"abhayzala","type":"user"},"name":"Abhay Zala","status":"admin_assigned","statusLastChangedAt":"2023-09-27T10:21:03.479Z","hidden":false},{"_id":"651382954483b69098d5b408","user":{"_id":"5ffe32d8942cf3533d364449","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654821969191-5ffe32d8942cf3533d364449.jpeg","isPro":false,"fullname":"Jaemin Cho","user":"j-min","type":"user"},"name":"Jaemin Cho","status":"admin_assigned","statusLastChangedAt":"2023-09-27T10:21:27.282Z","hidden":false},{"_id":"651382954483b69098d5b409","name":"Mohit Bansal","hidden":false}],"publishedAt":"2023-09-26T17:36:26.000Z","submittedOnDailyAt":"2023-09-26T23:47:13.834Z","title":"VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided\n Planning","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Although recent text-to-video (T2V) generation methods have seen significant\nadvancements, most of these works focus on producing short video clips of a\nsingle event with a single background (i.e., single-scene videos). Meanwhile,\nrecent large language models (LLMs) have demonstrated their capability in\ngenerating layouts and programs to control downstream visual modules such as\nimage generation models. This raises an important question: can we leverage the\nknowledge embedded in these LLMs for temporally consistent long video\ngeneration? In this paper, we propose VideoDirectorGPT, a novel framework for\nconsistent multi-scene video generation that uses the knowledge of LLMs for\nvideo content planning and grounded video generation. Specifically, given a\nsingle text prompt, we first ask our video planner LLM (GPT-4) to expand it\ninto a 'video plan', which involves generating the scene descriptions, the\nentities with their respective layouts, the background for each scene, and\nconsistency groupings of the entities and backgrounds. Next, guided by this\noutput from the video planner, our video generator, Layout2Vid, has explicit\ncontrol over spatial layouts and can maintain temporal consistency of\nentities/backgrounds across scenes, while only trained with image-level\nannotations. Our experiments demonstrate that VideoDirectorGPT framework\nsubstantially improves layout and movement control in both single- and\nmulti-scene video generation and can generate multi-scene videos with visual\nconsistency across scenes, while achieving competitive performance with SOTAs\nin open-domain single-scene T2V generation. We also demonstrate that our\nframework can dynamically control the strength for layout guidance and can also\ngenerate videos with user-provided images. We hope our framework can inspire\nfuture work on better integrating the planning ability of LLMs into consistent\nlong video generation.","upvotes":35,"discussionId":"651382994483b69098d5b464","ai_summary":"VideoDirectorGPT combines LLMs for video content planning with a layout-controlled video generator, enabling temporally consistent multi-scene video generation from text prompts.","ai_keywords":["text-to-video (T2V) generation","large language models (LLMs)","video content planning","grounded video generation","video plan","scene descriptions","entities","spacial layouts","temporal consistency","Layout2Vid","open-domain single-scene T2V generation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64258684ce70775b51b0c887","avatarUrl":"/avatars/d63c8b073e402ab882015a400a21516c.svg","isPro":false,"fullname":"cpimrty furle","user":"cpimrtyCfurle","type":"user"},{"_id":"6505a02f9310ce8c400edc63","avatarUrl":"/avatars/bbf781594fc8c812316711aa8e2797aa.svg","isPro":false,"fullname":"Fangfu Liu","user":"Liuff23","type":"user"},{"_id":"631ece54c1a8269da391efe9","avatarUrl":"/avatars/eea8d90e514d8b011d1549d401bd9f9f.svg","isPro":false,"fullname":"Dhruvajyoti Sarma","user":"dhruva-sarma","type":"user"},{"_id":"623c636949b6a399ee11152e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/623c636949b6a399ee11152e/s_58Qr4gM-ZdLd1cegTQO.png","isPro":false,"fullname":"Gyanateet Dutta","user":"Ryukijano","type":"user"},{"_id":"5e67bdd61009063689407479","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857146757-5e67bdd61009063689407479.jpeg","isPro":true,"fullname":"Clem 🤗","user":"clem","type":"user"},{"_id":"636ac507e3ad78bc68b31cfe","avatarUrl":"/avatars/e6dd4027945909c7cf13c61807c78f23.svg","isPro":false,"fullname":"Anas Saeed","user":"SaeedAnas","type":"user"},{"_id":"63566ae2ec9b9e0aeba8ba47","avatarUrl":"/avatars/f5943e18ae72aaf887d25cec20be3d15.svg","isPro":false,"fullname":"hyeonho","user":"Hyeonho99","type":"user"},{"_id":"65146d524a0c7cab164f8d6d","avatarUrl":"/avatars/4999201313fe2fb1b2a16beda3382994.svg","isPro":false,"fullname":"Max Pavlov","user":"undoing","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"60642131b1703ddba0d458ae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1617174970281-60642131b1703ddba0d458ae.jpeg","isPro":false,"fullname":"Rahi Akela","user":"rahiakela","type":"user"},{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"_id":"632191bd48b40e273b58a103","avatarUrl":"/avatars/81176d0918cd162b930c9896b6cf4fb8.svg","isPro":true,"fullname":"1001","user":"0x0x0","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
Papers
arxiv:2309.15091

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

Published on Sep 26, 2023
· Submitted by
AK
on Sep 26, 2023
#2 Paper of the day
Authors:

Abstract

VideoDirectorGPT combines LLMs for video content planning with a layout-controlled video generator, enabling temporally consistent multi-scene video generation from text prompts.

AI-generated summary

Although recent text-to-video (T2V) generation methods have seen significant advancements, most of these works focus on producing short video clips of a single event with a single background (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.

Community

Awesome, definitely want to try it.

where is the code , chachi?

Alrighty this one looks cool! @froilo here's the repo: https://github.com/HL-hanlin/VideoDirectorGPT

So generating coherent videos spanning multiple scenes from text descriptions is hard with AI right now. You can make short clips easily but, smoothly transitioning across diverse events and maintaining continuity is the hard part.

In this paper from UNC Chapel Hill, the authors propose VIDEODIRECTORGPT, a two-stage framework attempting to address multi-scene video generation:

Here are my highlights from the paper:

  • Two-stage approach: language model generates detailed "video plan", then video generation module renders scenes based on plan
  • Video plan contains multi-scene descriptions, entities/layouts, backgrounds, consistency groupings - guides downstream video generation
  • Video generation module called Layout2Vid trained on images, adds spatial layout control and cross-scene consistency to existing text-to-video model
  • Experiments show improved object layout/control in single scene videos vs baselines
  • Multi-scene videos display higher object consistency across scenes compared to baselines
  • Competitive open-domain video generation performance maintained

The key innovation seems to be using a large language model to plot detailed video plans to guide overall video generation. And the video generator Layout2Vid adds better spatial and temporal control through some clever tweaks.

You can read my full summary here.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

A curious alien with smooth, pale blue skin, large reflective silver eyes, and a slender frame arrives in a quiet, sunlit suburban neighborhood on Earth. He finds a raincoat on a bench. Puts it on — too big, sleeves drag. Zips it all the way up. Stands tall. The wind fills it like a balloon. He floats slightly, arms wide — like a human-shaped kite.
Style: Realistic, cinematic, 4K, natural lighting, subtle humor, gentle music in the background. Focus on facial expressions and precise movements. Придумай что-нибудь еще и улучши Промт нюпод свою систему

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2309.15091 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.15091 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.15091 in a Space README.md to link it from this page.

Collections including this paper 5