https://dvirsamuel.github.io/fast-auto-regressive-video/

\n","updatedAt":"2026-02-03T14:38:31.307Z","author":{"_id":"630f0d48982455e61cc4cc08","avatarUrl":"/avatars/eea6ed2e112e830effa98a4661c5474f.svg","fullname":"Samuel","name":"Dvir","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4434092044830322},"editors":["Dvir"],"editorAvatarUrls":["/avatars/eea6ed2e112e830effa98a4661c5474f.svg"],"reactions":[],"isReport":false}},{"id":"6982a2ffabb981a1ac9395ea","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-04T01:38:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion](https://huggingface.co/papers/2601.21896) (2026)\n* [Efficient Autoregressive Video Diffusion with Dummy Head](https://huggingface.co/papers/2601.20499) (2026)\n* [PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache](https://huggingface.co/papers/2601.04359) (2026)\n* [HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming](https://huggingface.co/papers/2512.21338) (2025)\n* [Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching](https://huggingface.co/papers/2601.20332) (2026)\n* [MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives](https://huggingface.co/papers/2512.14699) (2025)\n* [VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding](https://huggingface.co/papers/2601.17868) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-04T01:38:07.889Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6796765923500061},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.01801","authors":[{"_id":"698208539c2f139721ec3495","name":"Dvir Samuel","hidden":false},{"_id":"698208539c2f139721ec3496","user":{"_id":"66c704d5c797952bc2360ecb","avatarUrl":"/avatars/11e999a17c043c100571d8df0d966fdf.svg","isPro":false,"fullname":"issart","user":"issart12345","type":"user"},"name":"Issar Tzachor","status":"claimed_verified","statusLastChangedAt":"2026-02-09T08:36:10.787Z","hidden":false},{"_id":"698208539c2f139721ec3497","user":{"_id":"630488f55d136debceca5bdd","avatarUrl":"/avatars/7eb5db358992648198e6f566b98681e8.svg","isPro":false,"fullname":"Matan Levy","user":"Matanl","type":"user"},"name":"Matan Levy","status":"claimed_verified","statusLastChangedAt":"2026-02-04T12:31:33.514Z","hidden":false},{"_id":"698208539c2f139721ec3498","name":"Micahel Green","hidden":false},{"_id":"698208539c2f139721ec3499","name":"Gal Chechik","hidden":false},{"_id":"698208539c2f139721ec349a","name":"Rami Ben-Ari","hidden":false}],"publishedAt":"2026-02-02T08:31:21.000Z","submittedOnDailyAt":"2026-02-03T12:08:31.298Z","title":"Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention","submittedOnDailyBy":{"_id":"630f0d48982455e61cc4cc08","avatarUrl":"/avatars/eea6ed2e112e830effa98a4661c5474f.svg","isPro":false,"fullname":"Samuel","user":"Dvir","type":"user"},"summary":"Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.","upvotes":28,"discussionId":"698208539c2f139721ec349b","projectPage":"https://dvirsamuel.github.io/fast-auto-regressive-video/","ai_summary":"Autoregressive video diffusion models face efficiency challenges due to growing KV caches and redundant attention computations, which are addressed through TempCache, AnnCA, and AnnSA techniques that reduce computational demands while maintaining visual quality and stable performance.","ai_keywords":["autoregressive video diffusion models","KV cache","attention layers","temporal correspondence","cross-attention","self-attention","fast approximate nearest neighbor","ANN","temporal redundancy","semantic queries","frame-relevant tokens"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630f0d48982455e61cc4cc08","avatarUrl":"/avatars/eea6ed2e112e830effa98a4661c5474f.svg","isPro":false,"fullname":"Samuel","user":"Dvir","type":"user"},{"_id":"6849336d94f8a257886d8200","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/3FC9fZqe33w_T4-BWbeoC.png","isPro":false,"fullname":"Tal Shalev","user":"talko1989","type":"user"},{"_id":"66c498f2b9cc84906f82b99a","avatarUrl":"/avatars/3235996f5e61beb86866d325491d3d97.svg","isPro":false,"fullname":"Gavriel H","user":"gavrielh","type":"user"},{"_id":"689e23d2d310cc01ce25ed05","avatarUrl":"/avatars/c273c2e119d86583dbea5a0178968204.svg","isPro":false,"fullname":"Noa Barzilay","user":"NoaBarzilay","type":"user"},{"_id":"63c59c3a6d132b995fedface","avatarUrl":"/avatars/4e18b19e477cb683ce1ba3ae6ab77d8e.svg","isPro":false,"fullname":"Ohad rahamim","user":"ohad204","type":"user"},{"_id":"646d239f4220471ca0c6471c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646d239f4220471ca0c6471c/sRwzko8XEUVCkeD7jXceH.jpeg","isPro":false,"fullname":"Guy Yariv","user":"GuyYariv","type":"user"},{"_id":"630488f55d136debceca5bdd","avatarUrl":"/avatars/7eb5db358992648198e6f566b98681e8.svg","isPro":false,"fullname":"Matan Levy","user":"Matanl","type":"user"},{"_id":"656865932f3ec92d7e86fca6","avatarUrl":"/avatars/93ffd2b01e640d55a3413b4a6bea69c0.svg","isPro":false,"fullname":"Or","user":"orshij","type":"user"},{"_id":"6310627fa84f681a1193b372","avatarUrl":"/avatars/53ac50eb70f8fcf8c255e9bc0207790a.svg","isPro":false,"fullname":"Yuval Atzmon","user":"yatzmon","type":"user"},{"_id":"632add454f9006565ae542be","avatarUrl":"/avatars/0937fd1690052f854a5fb6f80cfca696.svg","isPro":false,"fullname":"Omer Regev","user":"omeregev","type":"user"},{"_id":"6465fd33dac127ac80f0b334","avatarUrl":"/avatars/113f02c1b1f8d33d3487daa867afcd3f.svg","isPro":false,"fullname":"Jonathan Kahana","user":"jonkahana","type":"user"},{"_id":"65c43b8e61c8e6d06ab4bd41","avatarUrl":"/avatars/c97b98252ec3a0e27ea4e561fc901042.svg","isPro":false,"fullname":"NivCohen","user":"NivC","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2602.01801

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Published on Feb 2

· Submitted by

Samuel on Feb 3

Upvote

Authors:

Issar Tzachor ,

Matan Levy ,

Abstract

Autoregressive video diffusion models face efficiency challenges due to growing KV caches and redundant attention computations, which are addressed through TempCache, AnnCA, and AnnSA techniques that reduce computational demands while maintaining visual quality and stable performance.

AI-generated summary

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.