Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Efficient Autoregressive Video Diffusion with Dummy Head
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-06T01:38:43.984Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6594904661178589},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"6985e56dc3882063e8758fdc","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-02-06T12:58:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper π https://arxivlens.com/PaperView/Details/efficient-autoregressive-video-diffusion-with-dummy-head-750-aac1a39d\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"
\n","updatedAt":"2026-02-06T12:58:21.808Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6094051599502563},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"69874e46351583ce5d32c97f","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-02-07T14:37:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper π https://arxivlens.com/PaperView/Details/efficient-autoregressive-video-diffusion-with-dummy-head-750-aac1a39d\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"
\n","updatedAt":"2026-02-07T14:37:58.210Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6094051599502563},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.20499","authors":[{"_id":"69844621e34659da7e1f50d5","user":{"_id":"67d6bb22eab66ce9cb4e3662","avatarUrl":"/avatars/814704eef9e6907d9d4ab407b605566b.svg","isPro":false,"fullname":"Hang Guo","user":"HangGuo","type":"user"},"name":"Hang Guo","status":"claimed_verified","statusLastChangedAt":"2026-02-05T10:52:07.030Z","hidden":false},{"_id":"69844621e34659da7e1f50d6","name":"Zhaoyang Jia","hidden":false},{"_id":"69844621e34659da7e1f50d7","name":"Jiahao Li","hidden":false},{"_id":"69844621e34659da7e1f50d8","name":"Bin Li","hidden":false},{"_id":"69844621e34659da7e1f50d9","name":"Yuanhao Cai","hidden":false},{"_id":"69844621e34659da7e1f50da","name":"Jiangshan Wang","hidden":false},{"_id":"69844621e34659da7e1f50db","name":"Yawei Li","hidden":false},{"_id":"69844621e34659da7e1f50dc","name":"Yan Lu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/67d6bb22eab66ce9cb4e3662/Rv8QZ2iY4ipyUaTkQytH8.mp4"],"publishedAt":"2026-01-28T11:20:43.000Z","submittedOnDailyAt":"2026-02-05T05:03:59.115Z","title":"Efficient Autoregressive Video Diffusion with Dummy Head","submittedOnDailyBy":{"_id":"67d6bb22eab66ce9cb4e3662","avatarUrl":"/avatars/814704eef9e6907d9d4ab407b605566b.svg","isPro":false,"fullname":"Hang Guo","user":"HangGuo","type":"user"},"summary":"The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusively to the current frame, and discarding their KV caches incurs only minor performance degradation. Building upon this, we propose Dummy Forcing, a simple yet effective method to control context accessibility across different heads. Specifically, the proposed heterogeneous memory allocation reduces head-wise context redundancy, accompanied by dynamic head programming to adaptively classify head types. Moreover, we develop a context packing technique to achieve more aggressive cache compression. Without additional training, our Dummy Forcing delivers up to 2.0x speedup over the baseline, supporting video generation at 24.3 FPS with less than 0.5% quality drop. Project page is available at https://csguoh.github.io/project/DummyForcing/.","upvotes":8,"discussionId":"69844621e34659da7e1f50dd","projectPage":"https://csguoh.github.io/project/DummyForcing/","githubRepo":"https://github.com/csguoh/DummyForcing","githubRepoAddedBy":"user","ai_summary":"Autoregressive video diffusion models suffer from inefficient attention mechanisms that underutilize historical frames, but a new method called Dummy Forcing improves efficiency through heterogeneous memory allocation and dynamic head programming while maintaining quality.","ai_keywords":["autoregressive video diffusion model","multi-head self-attention","causal modeling","iterative denoising","KV caches","heterogeneous memory allocation","dynamic head programming","context packing","cache compression"],"githubStars":50,"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67d6bb22eab66ce9cb4e3662","avatarUrl":"/avatars/814704eef9e6907d9d4ab407b605566b.svg","isPro":false,"fullname":"Hang Guo","user":"HangGuo","type":"user"},{"_id":"6604d9bd0a114bc034c93e15","avatarUrl":"/avatars/1279d022bb56ce4546ffd763dc8c8995.svg","isPro":false,"fullname":"Jiangshan Wang","user":"wjs0725","type":"user"},{"_id":"67bbade8a8c89b98ec377944","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67bbade8a8c89b98ec377944/HPtKDo8fnKr4OxpN1Z17D.png","isPro":false,"fullname":"Urodoc Oncall","user":"UDCAI","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6291b654a29097b211bd0665","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6291b654a29097b211bd0665/QJkzfayU6_Jz18PJxoXel.png","isPro":false,"fullname":"Yan Lu","user":"Jason-Lu","type":"user"},{"_id":"673969726c12c4b98b6ab29f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/C2elfn7L68jAt4dtHzDAW.png","isPro":false,"fullname":"Yuanhao Cai","user":"CaiYuanhao","type":"user"},{"_id":"62447e04f555de1927a9c879","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648655841478-noauth.png","isPro":false,"fullname":"jasonjiang","user":"mikinyaa","type":"user"},{"_id":"60525c2ba7226b25aaeea2ba","avatarUrl":"/avatars/fa52f0ed961993dce0a5c271dca0b4b7.svg","isPro":false,"fullname":"Daniel Darabos","user":"darabos","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"}}">
Autoregressive video diffusion models suffer from inefficient attention mechanisms that underutilize historical frames, but a new method called Dummy Forcing improves efficiency through heterogeneous memory allocation and dynamic head programming while maintaining quality.
AI-generated summary
The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusively to the current frame, and discarding their KV caches incurs only minor performance degradation. Building upon this, we propose Dummy Forcing, a simple yet effective method to control context accessibility across different heads. Specifically, the proposed heterogeneous memory allocation reduces head-wise context redundancy, accompanied by dynamic head programming to adaptively classify head types. Moreover, we develop a context packing technique to achieve more aggressive cache compression. Without additional training, our Dummy Forcing delivers up to 2.0x speedup over the baseline, supporting video generation at 24.3 FPS with less than 0.5% quality drop. Project page is available at https://csguoh.github.io/project/DummyForcing/.
Dummy Forcing is built on the observation that about 25% attention heads in existing autoregressive video diffusion models are "dummy", attending almost exclusively to the current frame despite access to historical context. Based on this observation, Dummy Forcing develops a technique to automatically identifies dummy heads and allocates varying context. Leveraging this "dummy property", we can enable 1. Efficient Video Generation at 24.3FPS real-time speed. 2. High-resolution Video Generation which supports 720P&1080P with 2.0x speedup. 3. Long-context Video Gneration to enlarge the context window by 6.58x without losing efficiency.