http://geometrylearning.com/SketchVideo/
Code:
https://github.com/IGLICT/SketchVideoVideo:
https://www.youtube.com/watch?v=eo5DNiaGgiQ
\n","updatedAt":"2025-04-01T03:49:10.120Z","author":{"_id":"6424538b9f9e65b42389920e","avatarUrl":"/avatars/9b912e2af9eebe9a481181f006765059.svg","fullname":"Feng-Lin Liu","name":"Okrin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3975258469581604},"editors":["Okrin"],"editorAvatarUrls":["/avatars/9b912e2af9eebe9a481181f006765059.svg"],"reactions":[],"isReport":false}},{"id":"67eb8688c73d98e5ceb5fd3d","author":{"_id":"62f5c30762d21b23a9592978","avatarUrl":"/avatars/8918dd267e77169eaacfa92a21652753.svg","fullname":"yang","name":"jie123","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-04-01T06:24:08.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"really cool","html":"
really cool
\n","updatedAt":"2025-04-01T06:24:08.706Z","author":{"_id":"62f5c30762d21b23a9592978","avatarUrl":"/avatars/8918dd267e77169eaacfa92a21652753.svg","fullname":"yang","name":"jie123","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9870531558990479},"editors":["jie123"],"editorAvatarUrls":["/avatars/8918dd267e77169eaacfa92a21652753.svg"],"reactions":[],"isReport":false}},{"id":"67ec94220140faafbe5430d4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-02T01:34:26.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VACE: All-in-One Video Creation and Editing](https://huggingface.co/papers/2503.07598) (2025)\n* [Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance](https://huggingface.co/papers/2503.18386) (2025)\n* [Get In Video: Add Anything You Want to the Video](https://huggingface.co/papers/2503.06268) (2025)\n* [I2V3D: Controllable image-to-video generation with 3D guidance](https://huggingface.co/papers/2503.09733) (2025)\n* [DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation](https://huggingface.co/papers/2503.06053) (2025)\n* [HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation](https://huggingface.co/papers/2502.04847) (2025)\n* [DreamInsert: Zero-Shot Image-to-Video Object Insertion from A Single Image](https://huggingface.co/papers/2503.10342) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-04-02T01:34:26.932Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6838790774345398},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.23284","authors":[{"_id":"67eb5280aeab4ce97de07134","user":{"_id":"6424538b9f9e65b42389920e","avatarUrl":"/avatars/9b912e2af9eebe9a481181f006765059.svg","isPro":false,"fullname":"Feng-Lin Liu","user":"Okrin","type":"user"},"name":"Feng-Lin Liu","status":"claimed_verified","statusLastChangedAt":"2025-04-01T07:47:05.907Z","hidden":false},{"_id":"67eb5280aeab4ce97de07135","user":{"_id":"662cd8b9322afcbae53fb06e","avatarUrl":"/avatars/9847f5c2282d49e61e76a0a303e0b2b1.svg","isPro":false,"fullname":"fuhongbo","user":"fuhongbo","type":"user"},"name":"Hongbo Fu","status":"admin_assigned","statusLastChangedAt":"2025-04-01T08:07:01.616Z","hidden":false},{"_id":"67eb5280aeab4ce97de07136","user":{"_id":"60e272ca6c78a8c122b12127","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/xldEGBzGrU-bX6IwAw0Ie.jpeg","isPro":false,"fullname":"Xintao Wang","user":"Xintao","type":"user"},"name":"Xintao Wang","status":"admin_assigned","statusLastChangedAt":"2025-04-01T08:07:09.910Z","hidden":false},{"_id":"67eb5280aeab4ce97de07137","user":{"_id":"6360d9f0472131c3bc4f61df","avatarUrl":"/avatars/c5d884e5ef19b781e3405aba6dd68ca8.svg","isPro":false,"fullname":"WeicaiYe","user":"WeicaiYe","type":"user"},"name":"Weicai Ye","status":"admin_assigned","statusLastChangedAt":"2025-04-01T08:07:16.323Z","hidden":false},{"_id":"67eb5280aeab4ce97de07138","user":{"_id":"662f93942510ef5735d7ad00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/662f93942510ef5735d7ad00/ZIDIPm63sncIHFTT5b0uR.png","isPro":false,"fullname":"magicwpf","user":"magicwpf","type":"user"},"name":"Pengfei Wan","status":"claimed_verified","statusLastChangedAt":"2025-04-03T08:29:02.682Z","hidden":false},{"_id":"67eb5280aeab4ce97de07139","user":{"_id":"644c8324f02250233d0d67d9","avatarUrl":"/avatars/feb39d281457c1750f3eada3c060a23e.svg","isPro":false,"fullname":"Di Zhang","user":"dizhang","type":"user"},"name":"Di Zhang","status":"admin_assigned","statusLastChangedAt":"2025-04-01T08:07:30.236Z","hidden":false},{"_id":"67eb5280aeab4ce97de0713a","name":"Lin Gao","hidden":false}],"publishedAt":"2025-03-30T02:44:09.000Z","submittedOnDailyAt":"2025-04-01T02:19:10.110Z","title":"SketchVideo: Sketch-based Video Generation and Editing","submittedOnDailyBy":{"_id":"6424538b9f9e65b42389920e","avatarUrl":"/avatars/9b912e2af9eebe9a481181f006765059.svg","isPro":false,"fullname":"Feng-Lin Liu","user":"Okrin","type":"user"},"summary":"Video generation and editing conditioned on text prompts or images have\nundergone significant advancements. However, challenges remain in accurately\ncontrolling global layout and geometry details solely by texts, and supporting\nmotion control and local modification through images. In this paper, we aim to\nachieve sketch-based spatial and motion control for video generation and\nsupport fine-grained editing of real or synthetic videos. Based on the DiT\nvideo generation model, we propose a memory-efficient control structure with\nsketch control blocks that predict residual features of skipped DiT blocks.\nSketches are drawn on one or two keyframes (at arbitrary time points) for easy\ninteraction. To propagate such temporally sparse sketch conditions across all\nframes, we propose an inter-frame attention mechanism to analyze the\nrelationship between the keyframes and each video frame. For sketch-based video\nediting, we design an additional video insertion module that maintains\nconsistency between the newly edited content and the original video's spatial\nfeature and dynamic motion. During inference, we use latent fusion for the\naccurate preservation of unedited regions. Extensive experiments demonstrate\nthat our SketchVideo achieves superior performance in controllable video\ngeneration and editing.","upvotes":23,"discussionId":"67eb5286aeab4ce97de07320","projectPage":"http://geometrylearning.com/SketchVideo/","githubRepo":"https://github.com/IGLICT/SketchVideo","githubRepoAddedBy":"user","ai_summary":"SketchVideo enhances video generation and editing by incorporating sketch control blocks, inter-frame attention, and latent fusion, improving global layout, motion control, and consistency.","ai_keywords":["DiT","memory-efficient control structure","sketch control blocks","keyframes","inter-frame attention","video insertion module","latent fusion","controllable video generation","video editing"],"githubStars":100},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6424538b9f9e65b42389920e","avatarUrl":"/avatars/9b912e2af9eebe9a481181f006765059.svg","isPro":false,"fullname":"Feng-Lin Liu","user":"Okrin","type":"user"},{"_id":"65655763995cc49553e0f40a","avatarUrl":"/avatars/b76fb22fb59a10da134d2f317f49ccf2.svg","isPro":false,"fullname":"lingxiaozhang","user":"jwzxgy2007","type":"user"},{"_id":"6697379f776578cbfef1d4e4","avatarUrl":"/avatars/dabb9cc40aa4cd42cb9832bdd7cf90a0.svg","isPro":false,"fullname":"Wang Yichen","user":"yichen191","type":"user"},{"_id":"641bc8321911d3be674779d3","avatarUrl":"/avatars/34fd49ad6b939158a784242cc6223c9b.svg","isPro":false,"fullname":"SK","user":"skobalt","type":"user"},{"_id":"62f5c30762d21b23a9592978","avatarUrl":"/avatars/8918dd267e77169eaacfa92a21652753.svg","isPro":false,"fullname":"yang","user":"jie123","type":"user"},{"_id":"64ddf3135e19298505422d30","avatarUrl":"/avatars/7bca8f429c07a6fe70d24cc418961029.svg","isPro":false,"fullname":"leo lv","user":"sagileo","type":"user"},{"_id":"662f93942510ef5735d7ad00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/662f93942510ef5735d7ad00/ZIDIPm63sncIHFTT5b0uR.png","isPro":false,"fullname":"magicwpf","user":"magicwpf","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"634dffc49b777beec3bc6448","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670144568552-634dffc49b777beec3bc6448.jpeg","isPro":false,"fullname":"Zhipeng Yang","user":"svjack","type":"user"},{"_id":"63d4c8ce13ae45b780792f32","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d4c8ce13ae45b780792f32/QasegimoxBqfZwDzorukz.png","isPro":false,"fullname":"Ohenenoo","user":"PeepDaSlan9","type":"user"},{"_id":"665b133508d536a8ac804f7d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Uwi0OnANdTbRbHHQvGqvR.png","isPro":false,"fullname":"Paulson","user":"Pnaomi","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
SketchVideo: Sketch-based Video Generation and Editing
Published on Mar 30, 2025
Abstract
SketchVideo enhances video generation and editing by incorporating sketch control blocks, inter-frame attention, and latent fusion, improving global layout, motion control, and consistency.
Video generation and editing conditioned on text prompts or images have
undergone significant advancements. However, challenges remain in accurately
controlling global layout and geometry details solely by texts, and supporting
motion control and local modification through images. In this paper, we aim to
achieve sketch-based spatial and motion control for video generation and
support fine-grained editing of real or synthetic videos. Based on the DiT
video generation model, we propose a memory-efficient control structure with
sketch control blocks that predict residual features of skipped DiT blocks.
Sketches are drawn on one or two keyframes (at arbitrary time points) for easy
interaction. To propagate such temporally sparse sketch conditions across all
frames, we propose an inter-frame attention mechanism to analyze the
relationship between the keyframes and each video frame. For sketch-based video
editing, we design an additional video insertion module that maintains
consistency between the newly edited content and the original video's spatial
feature and dynamic motion. During inference, we use latent fusion for the
accurate preservation of unedited regions. Extensive experiments demonstrate
that our SketchVideo achieves superior performance in controllable video
generation and editing.