Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - RepVideo: Rethinking Cross-Layer Representation for Video Generation
https://vchitect.github.io/RepVid-Webpage/\n","updatedAt":"2025-01-16T05:09:01.592Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9178,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4192781150341034},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"6789b34e2ec25c13a744d5ef","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-01-17T01:33:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment](https://huggingface.co/papers/2501.02706) (2025)\n* [BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations](https://huggingface.co/papers/2501.07647) (2025)\n* [Optical-Flow Guided Prompt Optimization for Coherent Video Generation](https://huggingface.co/papers/2411.15540) (2024)\n* [DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation](https://huggingface.co/papers/2412.18597) (2024)\n* [Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation](https://huggingface.co/papers/2412.06016) (2024)\n* [Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss](https://huggingface.co/papers/2501.07563) (2025)\n* [SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models](https://huggingface.co/papers/2412.10178) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-01-17T01:33:02.465Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6611007452011108},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"678d2471281a0e32fe8bdd0d","author":{"_id":"67871ea601caf476eca9da4e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/SZXgvTnnxJ1u1YsXUOgcw.png","fullname":"Tony Hawk","name":"ZaBigChief","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-19T16:12:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Is there anyone hosting this (similar to HuggingFace - does not have a demo) ?\nIf not - when will you host it on HuggingFace ?","html":"
Is there anyone hosting this (similar to HuggingFace - does not have a demo) ? If not - when will you host it on HuggingFace ?
\n","updatedAt":"2025-01-19T16:12:33.688Z","author":{"_id":"67871ea601caf476eca9da4e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/SZXgvTnnxJ1u1YsXUOgcw.png","fullname":"Tony Hawk","name":"ZaBigChief","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8408618569374084},"editors":["ZaBigChief"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/SZXgvTnnxJ1u1YsXUOgcw.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2501.08994","authors":[{"_id":"6788945e2b5050a9154d939d","user":{"_id":"635f8ed47c05eb9f59963d3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635f8ed47c05eb9f59963d3a/uQf4p9N9pSaFy87Wg9v4k.jpeg","isPro":false,"fullname":"ChenyangSi","user":"ChenyangSi","type":"user"},"name":"Chenyang Si","status":"admin_assigned","statusLastChangedAt":"2025-01-16T08:49:14.615Z","hidden":false},{"_id":"6788945e2b5050a9154d939e","user":{"_id":"6481764e8af4675862efb22e","avatarUrl":"/avatars/fc2e076bc861693f598a528a068a696e.svg","isPro":false,"fullname":"weichenfan","user":"weepiess2383","type":"user"},"name":"Weichen Fan","status":"admin_assigned","statusLastChangedAt":"2025-01-16T08:49:38.866Z","hidden":false},{"_id":"6788945e2b5050a9154d939f","user":{"_id":"645aff5121ab438e732c47c1","avatarUrl":"/avatars/23b2a853139b0f2ae1fa88e2bd4e0056.svg","isPro":false,"fullname":"Zhengyao Lv","user":"cszy98","type":"user"},"name":"Zhengyao Lv","status":"admin_assigned","statusLastChangedAt":"2025-01-16T08:49:45.309Z","hidden":false},{"_id":"6788945e2b5050a9154d93a0","user":{"_id":"60efe7fa0d920bc7805cada5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60efe7fa0d920bc7805cada5/2LBrJBjSCOP5ilZIpWLHl.png","isPro":false,"fullname":"Ziqi Huang","user":"Ziqi","type":"user"},"name":"Ziqi Huang","status":"admin_assigned","statusLastChangedAt":"2025-01-16T08:49:51.123Z","hidden":false},{"_id":"6788945e2b5050a9154d93a1","name":"Yu Qiao","hidden":false},{"_id":"6788945e2b5050a9154d93a2","user":{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},"name":"Ziwei Liu","status":"admin_assigned","statusLastChangedAt":"2025-01-16T08:49:58.339Z","hidden":false}],"publishedAt":"2025-01-15T18:20:37.000Z","submittedOnDailyAt":"2025-01-16T02:39:01.580Z","title":"RepVideo: Rethinking Cross-Layer Representation for Video Generation","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Video generation has achieved remarkable progress with the introduction of\ndiffusion models, which have significantly improved the quality of generated\nvideos. However, recent research has primarily focused on scaling up model\ntraining, while offering limited insights into the direct impact of\nrepresentations on the video generation process. In this paper, we initially\ninvestigate the characteristics of features in intermediate layers, finding\nsubstantial variations in attention maps across different layers. These\nvariations lead to unstable semantic representations and contribute to\ncumulative differences between features, which ultimately reduce the similarity\nbetween adjacent frames and negatively affect temporal coherence. To address\nthis, we propose RepVideo, an enhanced representation framework for\ntext-to-video diffusion models. By accumulating features from neighboring\nlayers to form enriched representations, this approach captures more stable\nsemantic information. These enhanced representations are then used as inputs to\nthe attention mechanism, thereby improving semantic expressiveness while\nensuring feature consistency across adjacent frames. Extensive experiments\ndemonstrate that our RepVideo not only significantly enhances the ability to\ngenerate accurate spatial appearances, such as capturing complex spatial\nrelationships between multiple objects, but also improves temporal consistency\nin video generation.","upvotes":15,"discussionId":"678894602b5050a9154d945b","projectPage":"https://vchitect.github.io/RepVid-Webpage/","githubRepo":"https://github.com/Vchitect/RepVideo","githubRepoAddedBy":"auto","ai_summary":"RepVideo enhances video generation by stabilizing semantic representations through enriched feature accumulation, improving spatial accuracy and temporal coherence in text-to-video diffusion models.","ai_keywords":["diffusion models","attention maps","semantic representations","temporal coherence","RepVideo","feature accumulation","spatial appearances","temporal consistency"],"githubStars":124},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"645dbaa6f5760d1530d7580d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645dbaa6f5760d1530d7580d/Bqob8arLZoHIgMwNZpL9I.jpeg","isPro":true,"fullname":"Simeon Emanuilov","user":"s-emanuilov","type":"user"},{"_id":"609653c1146ef3bfe2fc7392","avatarUrl":"/avatars/1639b6552a419209ae67b6562183bc2f.svg","isPro":false,"fullname":"Inui","user":"Norm","type":"user"},{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63054f9320668afe24865bba","avatarUrl":"/avatars/75962ffed33d38761bce6c947750e1e4.svg","isPro":false,"fullname":"KW","user":"kevineen","type":"user"},{"_id":"635f8ed47c05eb9f59963d3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635f8ed47c05eb9f59963d3a/uQf4p9N9pSaFy87Wg9v4k.jpeg","isPro":false,"fullname":"ChenyangSi","user":"ChenyangSi","type":"user"},{"_id":"656ee8008bb9f4f8d95bd8f7","avatarUrl":"/avatars/4069d70f1279d928da521211c495d638.svg","isPro":false,"fullname":"Hyeonho Jeong","user":"hyeonho-jeong-video","type":"user"},{"_id":"643b19f8a856622f978df30f","avatarUrl":"/avatars/c82779fdf94f80cdb5020504f83c818b.svg","isPro":false,"fullname":"Yatharth Sharma","user":"YaTharThShaRma999","type":"user"},{"_id":"64e567c9ddbefb63095a9662","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F2BwrOU0XpzVI5nd-TL54.png","isPro":false,"fullname":"Bullard ","user":"Charletta1","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"6683fc5344a65be1aab25dc0","avatarUrl":"/avatars/e13cde3f87b59e418838d702807df3b5.svg","isPro":false,"fullname":"hjkim","user":"hojie11","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
RepVideo enhances video generation by stabilizing semantic representations through enriched feature accumulation, improving spatial accuracy and temporal coherence in text-to-video diffusion models.
AI-generated summary
Video generation has achieved remarkable progress with the introduction of
diffusion models, which have significantly improved the quality of generated
videos. However, recent research has primarily focused on scaling up model
training, while offering limited insights into the direct impact of
representations on the video generation process. In this paper, we initially
investigate the characteristics of features in intermediate layers, finding
substantial variations in attention maps across different layers. These
variations lead to unstable semantic representations and contribute to
cumulative differences between features, which ultimately reduce the similarity
between adjacent frames and negatively affect temporal coherence. To address
this, we propose RepVideo, an enhanced representation framework for
text-to-video diffusion models. By accumulating features from neighboring
layers to form enriched representations, this approach captures more stable
semantic information. These enhanced representations are then used as inputs to
the attention mechanism, thereby improving semantic expressiveness while
ensuring feature consistency across adjacent frames. Extensive experiments
demonstrate that our RepVideo not only significantly enhances the ability to
generate accurate spatial appearances, such as capturing complex spatial
relationships between multiple objects, but also improves temporal consistency
in video generation.