Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
[go: Go Back, main page]

https://karine-h.github.io/GenMAC/
Paper: https://arxiv.org/pdf/2412.04440

\n","updatedAt":"2024-12-09T02:46:15.084Z","author":{"_id":"637cba13b8e573d75be96ea6","avatarUrl":"/avatars/5eca230e63d66947b2a05c1ff964a96c.svg","fullname":"Nina","name":"NinaKarine","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4150017499923706},"editors":["NinaKarine"],"editorAvatarUrls":["/avatars/5eca230e63d66947b2a05c1ff964a96c.svg"],"reactions":[],"isReport":false}},{"id":"67579aa52c2c9a9497d1c75b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-12-10T01:34:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement](https://huggingface.co/papers/2411.15115) (2024)\n* [DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation](https://huggingface.co/papers/2411.16657) (2024)\n* [InTraGen: Trajectory-controlled Video Generation for Object Interactions](https://huggingface.co/papers/2411.16804) (2024)\n* [Motion Control for Enhanced Complex Action Video Generation](https://huggingface.co/papers/2411.08328) (2024)\n* [Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback](https://huggingface.co/papers/2412.02617) (2024)\n* [STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training](https://huggingface.co/papers/2412.00161) (2024)\n* [VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation](https://huggingface.co/papers/2412.02259) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-12-10T01:34:29.914Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6967470049858093},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2412.04440","authors":[{"_id":"6752a1062f16c25ac96879dc","user":{"_id":"637cba13b8e573d75be96ea6","avatarUrl":"/avatars/5eca230e63d66947b2a05c1ff964a96c.svg","isPro":false,"fullname":"Nina","user":"NinaKarine","type":"user"},"name":"Kaiyi Huang","status":"admin_assigned","statusLastChangedAt":"2024-12-06T17:48:59.205Z","hidden":false},{"_id":"6752a1062f16c25ac96879dd","user":{"_id":"638ee900ee7e45e0474a5712","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638ee900ee7e45e0474a5712/KLli_eCbWwffKR7oLDmV3.jpeg","isPro":false,"fullname":"Yukun Huang","user":"KevinHuang","type":"user"},"name":"Yukun Huang","status":"claimed_verified","statusLastChangedAt":"2024-12-12T14:34:31.663Z","hidden":false},{"_id":"6752a1062f16c25ac96879de","user":{"_id":"641031b1a78453b8d96b8420","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678782881444-noauth.jpeg","isPro":false,"fullname":"Xuefei Ning","user":"Foxfi","type":"user"},"name":"Xuefei Ning","status":"claimed_verified","statusLastChangedAt":"2024-12-30T19:37:18.692Z","hidden":false},{"_id":"6752a1062f16c25ac96879df","user":{"_id":"64c832a8c547ed5243d29630","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c832a8c547ed5243d29630/a46V0xOVRVknM9mUjKkTf.jpeg","isPro":false,"fullname":"Zinan Lin","user":"fjxmlzn","type":"user"},"name":"Zinan Lin","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:07:13.931Z","hidden":false},{"_id":"6752a1062f16c25ac96879e0","name":"Yu Wang","hidden":false},{"_id":"6752a1062f16c25ac96879e1","user":{"_id":"65d5ec74cd05bc1eaa125040","avatarUrl":"/avatars/2de1b1539a86452c2c89570eeb02f5ab.svg","isPro":false,"fullname":"Xihui Liu","user":"XihuiLiu","type":"user"},"name":"Xihui Liu","status":"claimed_verified","statusLastChangedAt":"2025-05-26T08:44:05.510Z","hidden":false}],"publishedAt":"2024-12-05T18:56:05.000Z","submittedOnDailyAt":"2024-12-09T00:16:15.071Z","title":"GenMAC: Compositional Text-to-Video Generation with Multi-Agent\n Collaboration","submittedOnDailyBy":{"_id":"637cba13b8e573d75be96ea6","avatarUrl":"/avatars/5eca230e63d66947b2a05c1ff964a96c.svg","isPro":false,"fullname":"Nina","user":"NinaKarine","type":"user"},"summary":"Text-to-video generation models have shown significant progress in the recent\nyears. However, they still struggle with generating complex dynamic scenes\nbased on compositional text prompts, such as attribute binding for multiple\nobjects, temporal dynamics associated with different objects, and interactions\nbetween objects. Our key motivation is that complex tasks can be decomposed\ninto simpler ones, each handled by a role-specialized MLLM agent. Multiple\nagents can collaborate together to achieve collective intelligence for complex\ngoals. We propose GenMAC, an iterative, multi-agent framework that enables\ncompositional text-to-video generation. The collaborative workflow includes\nthree stages: Design, Generation, and Redesign, with an iterative loop between\nthe Generation and Redesign stages to progressively verify and refine the\ngenerated videos. The Redesign stage is the most challenging stage that aims to\nverify the generated videos, suggest corrections, and redesign the text\nprompts, frame-wise layouts, and guidance scales for the next iteration of\ngeneration. To avoid hallucination of a single MLLM agent, we decompose this\nstage to four sequentially-executed MLLM-based agents: verification agent,\nsuggestion agent, correction agent, and output structuring agent. Furthermore,\nto tackle diverse scenarios of compositional text-to-video generation, we\ndesign a self-routing mechanism to adaptively select the proper correction\nagent from a collection of correction agents each specialized for one scenario.\nExtensive experiments demonstrate the effectiveness of GenMAC, achieving\nstate-of-the art performance in compositional text-to-video generation.","upvotes":22,"discussionId":"6752a10a2f16c25ac9687b63","projectPage":"https://karine-h.github.io/GenMAC/","githubRepo":"https://github.com/Karine-Huang/GenMAC","githubRepoAddedBy":"auto","ai_summary":"GenMAC, a multi-agent framework, enhances text-to-video generation by decomposing complex tasks into simpler stages managed by specialized agents, ensuring high-quality and accurate output through iterative refinement.","ai_keywords":["text-to-video generation","MLLM","compositional text prompts","iterative framework","collaborative workflow","verification agent","suggestion agent","correction agent","output structuring agent","self-routing mechanism"],"githubStars":32},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"637cba13b8e573d75be96ea6","avatarUrl":"/avatars/5eca230e63d66947b2a05c1ff964a96c.svg","isPro":false,"fullname":"Nina","user":"NinaKarine","type":"user"},{"_id":"64b4eecf2fc8324fcb63b404","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b4eecf2fc8324fcb63b404/zGYqYVB4-o-GBMybJ8CDA.png","isPro":false,"fullname":"Yunhan Yang","user":"yhyang-myron","type":"user"},{"_id":"64a2b496e2e19de17db7de65","avatarUrl":"/avatars/241448ca487833d6cc5d57bb1fdb6ee5.svg","isPro":false,"fullname":"Duan Chengqi","user":"gogoduan","type":"user"},{"_id":"6427e08288215cee63b1c44d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6427e08288215cee63b1c44d/rzaG978FF-ywzicWNl_xl.jpeg","isPro":false,"fullname":"yao teng","user":"tytyt","type":"user"},{"_id":"672a037c19f1f942483f680c","avatarUrl":"/avatars/a48464044e9eb11a2bc062be05d9aa9a.svg","isPro":false,"fullname":"qiulu","user":"qiulu66","type":"user"},{"_id":"6440fc05603214724eba4766","avatarUrl":"/avatars/1a82a3361c96ba7bfd429dbd3e6f0bad.svg","isPro":false,"fullname":"weimeng","user":"mengwei0427","type":"user"},{"_id":"668125557b50b433cda2a211","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668125557b50b433cda2a211/j3z3wT5Rv9IyUKtbzQpnc.png","isPro":false,"fullname":"Tianwei Xiong","user":"YuuTennYi","type":"user"},{"_id":"64c832a8c547ed5243d29630","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c832a8c547ed5243d29630/a46V0xOVRVknM9mUjKkTf.jpeg","isPro":false,"fullname":"Zinan Lin","user":"fjxmlzn","type":"user"},{"_id":"638ee900ee7e45e0474a5712","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638ee900ee7e45e0474a5712/KLli_eCbWwffKR7oLDmV3.jpeg","isPro":false,"fullname":"Yukun Huang","user":"KevinHuang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","isPro":true,"fullname":"Orr Zohar","user":"orrzohar","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2412.04440

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Published on Dec 5, 2024
· Submitted by
Nina
on Dec 9, 2024

Abstract

GenMAC, a multi-agent framework, enhances text-to-video generation by decomposing complex tasks into simpler stages managed by specialized agents, ensuring high-quality and accurate output through iterative refinement.

AI-generated summary

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.

Community

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.04440 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.04440 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.04440 in a Space README.md to link it from this page.

Collections including this paper 5