Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation
\n","updatedAt":"2026-01-27T09:51:43.298Z","author":{"_id":"6862500e17f0599a5b56148e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/SI3BWqF2pNA9zfFtYhvvp.png","fullname":"SolomonShow","name":"Solomon88","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7542515397071838},"editors":["Solomon88"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/SI3BWqF2pNA9zfFtYhvvp.png"],"reactions":[],"isReport":false,"parentCommentId":"697832922303b7282c6f97d5"}}]},{"id":"697968d4ff25f02574388125","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-28T01:39:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [CineLOG: A Training Free Approach for Cinematic Long Video Generation](https://huggingface.co/papers/2512.12209) (2025)\n* [Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories](https://huggingface.co/papers/2512.16954) (2025)\n* [KlingAvatar 2.0 Technical Report](https://huggingface.co/papers/2512.13313) (2025)\n* [YingVideo-MV: Music-Driven Multi-Stage Video Generation](https://huggingface.co/papers/2512.02492) (2025)\n* [STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative](https://huggingface.co/papers/2512.12372) (2025)\n* [ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions](https://huggingface.co/papers/2512.10286) (2025)\n* [DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation](https://huggingface.co/papers/2512.21252) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-01-28T01:39:32.896Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7210453152656555},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"697979a4dcaa48e0d84d0dc9","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-01-28T02:51:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/the-script-is-all-you-need-an-agentic-framework-for-long-horizon-dialogue-to-cinematic-video-generation-5945-1346a51d\n\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"
\n","updatedAt":"2026-01-28T02:51:16.681Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6320548057556152},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.17737","authors":[{"_id":"6978310b026bdf0473116e44","user":{"_id":"64545c77a7ce0a8fde809912","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/VDaMEM77Xv09dP6B5v3sK.jpeg","isPro":false,"fullname":"ChenYuMu","user":"ChenYuMu","type":"user"},"name":"Chenyu Mu","status":"admin_assigned","statusLastChangedAt":"2026-01-27T14:01:21.462Z","hidden":false},{"_id":"6978310b026bdf0473116e45","user":{"_id":"6527a2df1eb78901534b0cc6","avatarUrl":"/avatars/f811d8c108930b41e2612c609d35e2eb.svg","isPro":false,"fullname":"Xin He","user":"Kleinhe","type":"user"},"name":"Xin He","status":"claimed_verified","statusLastChangedAt":"2026-01-27T09:03:20.414Z","hidden":false},{"_id":"6978310b026bdf0473116e46","user":{"_id":"64300415b009240418dac70c","avatarUrl":"/avatars/5175cdbc7683b0b52d5c742e93d3be83.svg","isPro":false,"fullname":"Qu Yang","user":"quyang22","type":"user"},"name":"Qu Yang","status":"claimed_verified","statusLastChangedAt":"2026-01-27T09:03:22.475Z","hidden":false},{"_id":"6978310b026bdf0473116e47","name":"Wanshun Chen","hidden":false},{"_id":"6978310b026bdf0473116e48","name":"Jiadi Yao","hidden":false},{"_id":"6978310b026bdf0473116e49","name":"Huang Liu","hidden":false},{"_id":"6978310b026bdf0473116e4a","name":"Zihao Yi","hidden":false},{"_id":"6978310b026bdf0473116e4b","name":"Bo Zhao","hidden":false},{"_id":"6978310b026bdf0473116e4c","name":"Xingyu Chen","hidden":false},{"_id":"6978310b026bdf0473116e4d","name":"Ruotian Ma","hidden":false},{"_id":"6978310b026bdf0473116e4e","name":"Fanghua Ye","hidden":false},{"_id":"6978310b026bdf0473116e4f","name":"Erkun Yang","hidden":false},{"_id":"6978310b026bdf0473116e50","name":"Cheng Deng","hidden":false},{"_id":"6978310b026bdf0473116e51","name":"Zhaopeng Tu","hidden":false},{"_id":"6978310b026bdf0473116e52","name":"Xiaolong Li","hidden":false},{"_id":"6978310b026bdf0473116e53","name":"Linus","hidden":false}],"publishedAt":"2026-01-25T08:10:28.000Z","submittedOnDailyAt":"2026-01-27T01:05:46.612Z","title":"The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation","submittedOnDailyBy":{"_id":"67485743561b1e6f9579389f","avatarUrl":"/avatars/8a4cc63bd7be388010bc329bb74582a1.svg","isPro":false,"fullname":"Zhaopeng Tu","user":"zptu","type":"user"},"summary":"Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.","upvotes":55,"discussionId":"6978310b026bdf0473116e54","projectPage":"https://xd-mu.github.io/ScriptIsAllYouNeed/","ai_summary":"A novel end-to-end agentic framework translates dialogue into cinematic videos through specialized agents that generate and orchestrate video content while maintaining narrative coherence.","ai_keywords":["video generation","dialogue-to-cinematic-video","ScripterAgent","DirectorAgent","cross-scene continuous generation","ScriptBench","Visual-Script Alignment","CriticAgent"],"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-uploads.huggingface.co/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67485743561b1e6f9579389f","avatarUrl":"/avatars/8a4cc63bd7be388010bc329bb74582a1.svg","isPro":false,"fullname":"Zhaopeng Tu","user":"zptu","type":"user"},{"_id":"64bf898d979949d2e2585c9a","avatarUrl":"/avatars/da77c856ec997e2b812c06272a01c8b2.svg","isPro":false,"fullname":"mengruwang","user":"mengru","type":"user"},{"_id":"6447843530fa4ecb85ddc889","avatarUrl":"/avatars/a97c4970a1a179ee8a2e2e6ab8f995f6.svg","isPro":false,"fullname":"Youliang Yuan","user":"Youliang","type":"user"},{"_id":"64632afe39359568c63ad9fc","avatarUrl":"/avatars/8dfc109a91abbb0ecfb15c15d4dff2a2.svg","isPro":false,"fullname":"Xingyu Chen","user":"galaxychen","type":"user"},{"_id":"63412a43a7582111c3f1cadd","avatarUrl":"/avatars/033e809a7aa3d9c4b1326e9195290f65.svg","isPro":false,"fullname":"Yixin Ji","user":"Yisam","type":"user"},{"_id":"657e7724e6759943575e7723","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/rNlerO2idts6j9f4UeYck.png","isPro":false,"fullname":"JeffWang","user":"JeffSy","type":"user"},{"_id":"64300415b009240418dac70c","avatarUrl":"/avatars/5175cdbc7683b0b52d5c742e93d3be83.svg","isPro":false,"fullname":"Qu Yang","user":"quyang22","type":"user"},{"_id":"64cf5de873dc458c16100ed1","avatarUrl":"/avatars/2f0af394c88962a9a9b5755a02f226b2.svg","isPro":false,"fullname":"threeColorFr","user":"threeColorFr","type":"user"},{"_id":"643645baaa4211ef553f613c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643645baaa4211ef553f613c/svUVeTqwLf5ZurprdTOUC.jpeg","isPro":false,"fullname":"TimLeung","user":"skytliang","type":"user"},{"_id":"65f7ad5526f86cf3378a59f2","avatarUrl":"/avatars/752aa3c072c71ece41ea786371574777.svg","isPro":false,"fullname":"Zihao Yi","user":"Zihao1","type":"user"},{"_id":"657ecc108888ccb894f91318","avatarUrl":"/avatars/0770b443d18724de5c816d54754a7289.svg","isPro":false,"fullname":"Jinda Liu","user":"Nelson919","type":"user"},{"_id":"66c7f331cc47b8e6e94d4297","avatarUrl":"/avatars/de8c1a84fee818e3a5a4a531806c6969.svg","isPro":false,"fullname":"cym","user":"XD-MU","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-uploads.huggingface.co/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"}}">
A novel end-to-end agentic framework translates dialogue into cinematic videos through specialized agents that generate and orchestrate video content while maintaining narrative coherence.
AI-generated summary
Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.