Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Fast Video Generation with Sliding Tile Attention
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-02-11T01:35:53.873Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6835502982139587},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.04507","authors":[{"_id":"67a98cd1b8b21202c9004628","user":{"_id":"63565cc56d7fcf1bedb7d347","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63565cc56d7fcf1bedb7d347/XGcHP4VkO_oieA1gZ4IAX.jpeg","isPro":false,"fullname":"Zhang Peiyuan","user":"PY007","type":"user"},"name":"Peiyuan Zhang","status":"admin_assigned","statusLastChangedAt":"2025-02-10T16:07:27.309Z","hidden":false},{"_id":"67a98cd1b8b21202c9004629","user":{"_id":"65416817271d3bc4d70f6745","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65416817271d3bc4d70f6745/1YkW0MpuufejvxqksVMIx.jpeg","isPro":false,"fullname":"Yongqi Chen","user":"BrianChen1129","type":"user"},"name":"Yongqi Chen","status":"claimed_verified","statusLastChangedAt":"2025-02-10T09:49:48.410Z","hidden":false},{"_id":"67a98cd1b8b21202c900462a","user":{"_id":"65d7ed4823e83e1591beacc7","avatarUrl":"/avatars/2a6714a2a7bbd591f6b726a7330bafbc.svg","isPro":false,"fullname":"Su","user":"r3su9","type":"user"},"name":"Runlong Su","status":"claimed_verified","statusLastChangedAt":"2025-02-11T10:00:40.942Z","hidden":false},{"_id":"67a98cd1b8b21202c900462b","user":{"_id":"643a451ee2b979ae6141329d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643a451ee2b979ae6141329d/HN3M5vyroanQoUEiXJFyB.jpeg","isPro":false,"fullname":"Hangliang Ding","user":"foreverpiano","type":"user"},"name":"Hangliang Ding","status":"claimed_verified","statusLastChangedAt":"2025-02-11T07:56:04.110Z","hidden":false},{"_id":"67a98cd1b8b21202c900462c","name":"Ion Stoica","hidden":false},{"_id":"67a98cd1b8b21202c900462d","name":"Zhenghong Liu","hidden":false},{"_id":"67a98cd1b8b21202c900462e","user":{"_id":"62d363143eebd640a4fa41fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d363143eebd640a4fa41fa/pvPwXlJ5OOb-UIfmffv4E.jpeg","isPro":false,"fullname":"Hao Zhang","user":"zhisbug","type":"user"},"name":"Hao Zhang","status":"claimed_verified","statusLastChangedAt":"2025-02-11T10:00:38.019Z","hidden":false}],"publishedAt":"2025-02-06T21:17:09.000Z","submittedOnDailyAt":"2025-02-10T02:52:26.568Z","title":"Fast Video Generation with Sliding Tile Attention","submittedOnDailyBy":{"_id":"63565cc56d7fcf1bedb7d347","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63565cc56d7fcf1bedb7d347/XGcHP4VkO_oieA1gZ4IAX.jpeg","isPro":false,"fullname":"Zhang Peiyuan","user":"PY007","type":"user"},"summary":"Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art\nvideo generation, but suffer from prohibitive compute cost -- when generating\njust a 5-second 720P video, attention alone takes 800 out of 945 seconds of\ntotal inference time. This paper introduces sliding tile attention (STA) to\naddress this challenge. STA leverages the observation that attention scores in\npretrained video diffusion models predominantly concentrate within localized 3D\nwindows. By sliding and attending over the local spatial-temporal region, STA\neliminates redundancy from full attention. Unlike traditional token-wise\nsliding window attention (SWA), STA operates tile-by-tile with a novel\nhardware-aware sliding window design, preserving expressiveness while being\nhardware-efficient. With careful kernel-level optimizations, STA offers the\nfirst efficient 2D/3D sliding-window-like attention implementation, achieving\n58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over\nFlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading\nvideo DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s\nwithout quality degradation, requiring no training. Enabling finetuning further\nlowers latency to 268s with only a 0.09% drop on VBench.","upvotes":51,"discussionId":"67a98cd7b8b21202c90047c5","ai_summary":"Sliding tile attention (STA) accelerates 3D attention in Diffusion Transformers (DiTs), reducing video generation latency without quality loss.","ai_keywords":["diffusion transformers","DiTs","3D full attention","sliding tile attention","STA","attention scores","hardware-aware sliding window","2D/3D sliding-window-like attention","FlashAttention-2","FlashAttention-3","HunyuanVideo","VBench"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62ba66296501b0ff15ba1075","avatarUrl":"/avatars/de468727cb1240fd4b1f24a19fb237a6.svg","isPro":true,"fullname":"Yichao Fu","user":"Viol2000","type":"user"},{"_id":"62d363143eebd640a4fa41fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d363143eebd640a4fa41fa/pvPwXlJ5OOb-UIfmffv4E.jpeg","isPro":false,"fullname":"Hao Zhang","user":"zhisbug","type":"user"},{"_id":"64e5d04767a759dd7155a445","avatarUrl":"/avatars/70090182b71db4de31fb8c325d25e73b.svg","isPro":false,"fullname":"Yinmin Zhong","user":"PKUFlyingPig","type":"user"},{"_id":"6301d6455e305a35cb0846a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6301d6455e305a35cb0846a7/aT2AtzRMSY_T3y02MIUap.jpeg","isPro":true,"fullname":"Lanxiang Hu","user":"Snyhlxde","type":"user"},{"_id":"63565cc56d7fcf1bedb7d347","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63565cc56d7fcf1bedb7d347/XGcHP4VkO_oieA1gZ4IAX.jpeg","isPro":false,"fullname":"Zhang Peiyuan","user":"PY007","type":"user"},{"_id":"675893bc6f24ab221a777c1e","avatarUrl":"/avatars/ac8e04fe7e6e0032503053071410dae2.svg","isPro":false,"fullname":"Guangtao Zeng","user":"GGGGGuangtao","type":"user"},{"_id":"65621fd68631d43d2baf33b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/1JemNCnPkS1mE3SNygsE2.png","isPro":false,"fullname":"siqi zhu","user":"zsqzz","type":"user"},{"_id":"646e1ef5075bbcc48ddf21e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/_vJC0zeVOIvaNV2R6toqg.jpeg","isPro":false,"fullname":"Pu Fanyi","user":"pufanyi","type":"user"},{"_id":"643839d9581e6bf0fa9c835e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643839d9581e6bf0fa9c835e/JxlgR-zQhms-rfF0sDxD8.jpeg","isPro":false,"fullname":"Junda Chen","user":"GindaChen","type":"user"},{"_id":"64bba541da140e461924dfed","avatarUrl":"/avatars/367993765b0ca3734b2b100db33ed787.svg","isPro":true,"fullname":"zhijie deng","user":"zhijie3","type":"user"},{"_id":"609115c79a8bcaa437b234a9","avatarUrl":"/avatars/1631a91030703d8397133363cf82c863.svg","isPro":false,"fullname":"Leng Sicong","user":"Sicong","type":"user"},{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","isPro":false,"fullname":"kcz","user":"kcz358","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Sliding tile attention (STA) accelerates 3D attention in Diffusion Transformers (DiTs), reducing video generation latency without quality loss.
AI-generated summary
Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art
video generation, but suffer from prohibitive compute cost -- when generating
just a 5-second 720P video, attention alone takes 800 out of 945 seconds of
total inference time. This paper introduces sliding tile attention (STA) to
address this challenge. STA leverages the observation that attention scores in
pretrained video diffusion models predominantly concentrate within localized 3D
windows. By sliding and attending over the local spatial-temporal region, STA
eliminates redundancy from full attention. Unlike traditional token-wise
sliding window attention (SWA), STA operates tile-by-tile with a novel
hardware-aware sliding window design, preserving expressiveness while being
hardware-efficient. With careful kernel-level optimizations, STA offers the
first efficient 2D/3D sliding-window-like attention implementation, achieving
58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over
FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading
video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s
without quality degradation, requiring no training. Enabling finetuning further
lowers latency to 268s with only a 0.09% drop on VBench.