https://congwei1230.github.io/MoCha/).\n

Auto-regressive for next-shot generation like Cut2Next (https://vchitect.github.io/Cut2Next-project/), LCT(https://guoyww.github.io/projects/long-context-video/), Mask2Dit (https://tianhao-qi.github.io/Mask2DiTProject/).

\n\n

The fundamental challenges include efficient implementations, data curation, and computing power.

\n","updatedAt":"2025-12-04T02:44:19.735Z","author":{"_id":"646f3418a6a58aa29505fd30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646f3418a6a58aa29505fd30/1z13rnpb6rsUgQsYumWPg.png","fullname":"QINGHE WANG","name":"Qinghew","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8227798938751221},"editors":["Qinghew"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/646f3418a6a58aa29505fd30/1z13rnpb6rsUgQsYumWPg.png"],"reactions":[],"isReport":false,"parentCommentId":"6930aac1a84fe3809b46c281"}}]},{"id":"6930e642f0df0882f60ce09d","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-12-04T01:39:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives](https://huggingface.co/papers/2510.20822) (2025)\n* [MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation](https://huggingface.co/papers/2510.18692) (2025)\n* [VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning](https://huggingface.co/papers/2510.08555) (2025)\n* [InstanceV: Instance-Level Video Generation](https://huggingface.co/papers/2511.23146) (2025)\n* [TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction](https://huggingface.co/papers/2511.12578) (2025)\n* [TGT: Text-Grounded Trajectories for Locally Controlled Video Generation](https://huggingface.co/papers/2510.15104) (2025)\n* [Video-As-Prompt: Unified Semantic Control for Video Generation](https://huggingface.co/papers/2510.20888) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-12-04T01:39:14.220Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6803765296936035},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2512.03041","authors":[{"_id":"692fcb3726742347f61dad18","user":{"_id":"646f3418a6a58aa29505fd30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646f3418a6a58aa29505fd30/1z13rnpb6rsUgQsYumWPg.png","isPro":false,"fullname":"QINGHE WANG","user":"Qinghew","type":"user"},"name":"Qinghe Wang","status":"claimed_verified","statusLastChangedAt":"2025-12-03T09:13:33.820Z","hidden":false},{"_id":"692fcb3726742347f61dad19","name":"Xiaoyu Shi","hidden":false},{"_id":"692fcb3726742347f61dad1a","user":{"_id":"652cb3e1c6857682d30d4c2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/JEQW6WZBcmjAWwODiDkYA.jpeg","isPro":false,"fullname":"Libaolu","user":"8ruceLi","type":"user"},"name":"Baolu Li","status":"claimed_verified","statusLastChangedAt":"2025-12-03T09:13:31.176Z","hidden":false},{"_id":"692fcb3726742347f61dad1b","name":"Weikang Bian","hidden":false},{"_id":"692fcb3726742347f61dad1c","name":"Quande Liu","hidden":false},{"_id":"692fcb3726742347f61dad1d","name":"Huchuan Lu","hidden":false},{"_id":"692fcb3726742347f61dad1e","name":"Xintao Wang","hidden":false},{"_id":"692fcb3726742347f61dad1f","name":"Pengfei Wan","hidden":false},{"_id":"692fcb3726742347f61dad20","name":"Kun Gai","hidden":false},{"_id":"692fcb3726742347f61dad21","name":"Xu Jia","hidden":false}],"publishedAt":"2025-12-02T18:59:48.000Z","submittedOnDailyAt":"2025-12-03T03:10:24.737Z","title":"MultiShotMaster: A Controllable Multi-Shot Video Generation Framework","submittedOnDailyBy":{"_id":"646f3418a6a58aa29505fd30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646f3418a6a58aa29505fd30/1z13rnpb6rsUgQsYumWPg.png","isPro":false,"fullname":"QINGHE WANG","user":"Qinghew","type":"user"},"summary":"Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.","upvotes":64,"discussionId":"692fcb3726742347f61dad22","projectPage":"https://qinghew.github.io/MultiShotMaster/","githubRepo":"https://github.com/KlingTeam/MultiShotMaster","githubRepoAddedBy":"user","ai_summary":"MultiShotMaster extends a single-shot model with novel RoPE variants for flexible and controllable multi-shot video generation, addressing data scarcity with an automated annotation pipeline.","ai_keywords":["RoPE","Multi-Shot Narrative RoPE","Spatiotemporal Position-Aware RoPE","reference tokens","cross-shot grounding signals","reference images","text-driven inter-shot consistency","motion control","background-driven customized scene"],"githubStars":59,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-uploads.huggingface.co/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"646f3418a6a58aa29505fd30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646f3418a6a58aa29505fd30/1z13rnpb6rsUgQsYumWPg.png","isPro":false,"fullname":"QINGHE WANG","user":"Qinghew","type":"user"},{"_id":"66a03a1a5be24c40eebca6f4","avatarUrl":"/avatars/7c738b8d9a5036b44323065ad6177a86.svg","isPro":false,"fullname":"DSY","user":"DSY001","type":"user"},{"_id":"64c475a388373ea6200849a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/lfajaYabjbzDiBWdDzGBE.jpeg","isPro":false,"fullname":"Oooowi","user":"ZiruiZheng","type":"user"},{"_id":"66f2748bfa0f7d682326c834","avatarUrl":"/avatars/cbd0e257ec752d803249d5422f7f7998.svg","isPro":false,"fullname":"Yiming Zhang","user":"zh1ym","type":"user"},{"_id":"642e427f6748dd4f8eeb2f38","avatarUrl":"/avatars/07158ff6aa1803c846403594c5d55a34.svg","isPro":false,"fullname":"Kaixiong Gong","user":"kxgong","type":"user"},{"_id":"64b4a717aa03b6520839e9b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b4a717aa03b6520839e9b8/Rt3ERG-6BVEA4hAwOz0_I.jpeg","isPro":false,"fullname":"Haiwen Diao","user":"Paranioar","type":"user"},{"_id":"66a0b434968fb59eb2434a0a","avatarUrl":"/avatars/67538f97ebf1995c530563f411d3cd95.svg","isPro":false,"fullname":"Zixuan Ye","user":"zixuan-ye","type":"user"},{"_id":"630cbcb7c169245d78ff33b6","avatarUrl":"/avatars/6cca9b73599df5eb1443e3b98d05974c.svg","isPro":false,"fullname":"QuandeLiu","user":"leronliu","type":"user"},{"_id":"645aff5121ab438e732c47c1","avatarUrl":"/avatars/23b2a853139b0f2ae1fa88e2bd4e0056.svg","isPro":false,"fullname":"Zhengyao Lv","user":"cszy98","type":"user"},{"_id":"67cab3247bf58b3d09191eb1","avatarUrl":"/avatars/1eeae8aa607bc24f16d847b8ca50b664.svg","isPro":false,"fullname":"ReCamMaster","user":"ReCamMaster","type":"user"},{"_id":"63fac5126b75d93aa136099c","avatarUrl":"/avatars/c1d7c6f29e67a6b2b97a7cd1cccedfb1.svg","isPro":false,"fullname":"Shian Du","user":"YOKIMIYA","type":"user"},{"_id":"6347aca88af48bc2fb5416c5","avatarUrl":"/avatars/ee56e7985599314c6c7076300dbde884.svg","isPro":false,"fullname":"Wang Qiulin","user":"Qiulin","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-uploads.huggingface.co/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"}}">

Papers

arxiv:2512.03041

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Published on Dec 2, 2025

· Submitted by

QINGHE WANG on Dec 3, 2025

Kling Team

Upvote

Authors:

Qinghe Wang ,

Baolu Li ,

Abstract

MultiShotMaster extends a single-shot model with novel RoPE variants for flexible and controllable multi-shot video generation, addressing data scarcity with an automated annotation pipeline.

AI-generated summary

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

View arXiv page View PDF Project page GitHub 59 Add to collection

Community

Qinghew

Paper author Paper submitter Dec 3, 2025

The first controllable multi-shot video generation framework that supports text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot counts and shot durations are variable.

EladofWar

Dec 3, 2025

Seems like it's only one step out from being SORA 2. Honestly if I had the compute, I'd love to work on this. Add just a bit more multimodal conditioning like by using VACE and maybe multitalk components to serve as your base model and you'd have a model that can generate from scratch with audio capability, and the ability to restyle/edit videos. There are also plenty of optimizations on auto-regressive generation that could be used for better speedups. So much good work that could be done.

Qinghew

Paper author Dec 4, 2025

Thank you for your interest and insightful comments! We have the same vision as well. There are some new research directions in the multi-shot setting:

Extending controllable functionalities (such as multimodal conditioning, style transfer) from single-shot to multi-shot settings.
Integrating audio capabilities for multi-shot conversation like Mocha (https://congwei1230.github.io/MoCha/).
Auto-regressive for next-shot generation like Cut2Next (https://vchitect.github.io/Cut2Next-project/), LCT(https://guoyww.github.io/projects/long-context-video/), Mask2Dit (https://tianhao-qi.github.io/Mask2DiTProject/).

The fundamental challenges include efficient implementations, data curation, and computing power.