Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
[go: Go Back, main page]

https://memoavatar.github.io

\n","updatedAt":"2024-12-06T15:28:41.076Z","author":{"_id":"63db5dc49f2687298a1547bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db5dc49f2687298a1547bf/xVFi0kRkYud191cQgma16.jpeg","fullname":"Longtao Zheng","name":"ltzheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8628595471382141},"editors":["ltzheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63db5dc49f2687298a1547bf/xVFi0kRkYud191cQgma16.jpeg"],"reactions":[],"isReport":false}},{"id":"6753a604c40164beeac5eaa0","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-12-07T01:33:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion](https://huggingface.co/papers/2411.16726) (2024)\n* [LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis](https://huggingface.co/papers/2411.16748) (2024)\n* [Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis](https://huggingface.co/papers/2411.19509) (2024)\n* [SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model](https://huggingface.co/papers/2412.03430) (2024)\n* [FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait](https://huggingface.co/papers/2412.01064) (2024)\n* [Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization](https://huggingface.co/papers/2410.14283) (2024)\n* [Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation](https://huggingface.co/papers/2410.07718) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-12-07T01:33:56.835Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6888076663017273},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2412.04448","authors":[{"_id":"675265f58be943d9cb551721","user":{"_id":"63db5dc49f2687298a1547bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db5dc49f2687298a1547bf/xVFi0kRkYud191cQgma16.jpeg","isPro":false,"fullname":"Longtao Zheng","user":"ltzheng","type":"user"},"name":"Longtao Zheng","status":"claimed_verified","statusLastChangedAt":"2025-01-24T10:01:44.427Z","hidden":false},{"_id":"675265f58be943d9cb551722","user":{"_id":"63aed0e7f873109b112dbb1b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63aed0e7f873109b112dbb1b/JkSkQ1a2SLq5eTTnz9F05.jpeg","isPro":false,"fullname":"Yifan Zhang","user":"Vanint","type":"user"},"name":"Yifan Zhang","status":"claimed_verified","statusLastChangedAt":"2025-06-25T08:10:52.657Z","hidden":false},{"_id":"675265f58be943d9cb551723","name":"Hanzhong Guo","hidden":false},{"_id":"675265f58be943d9cb551724","name":"Jiachun Pan","hidden":false},{"_id":"675265f58be943d9cb551725","user":{"_id":"640ebdfefdeaae139086f4d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640ebdfefdeaae139086f4d8/2N94gbHubplYD8njmUTPf.jpeg","isPro":true,"fullname":"Zhenxiong Tan","user":"Yuanshi","type":"user"},"name":"Zhenxiong Tan","status":"claimed_verified","statusLastChangedAt":"2025-12-31T21:01:07.598Z","hidden":false},{"_id":"675265f58be943d9cb551726","user":{"_id":"64d3abeb1058ae59738ba8ce","avatarUrl":"/avatars/4a9de3db835ade2c20f0d2de678e85c4.svg","isPro":false,"fullname":"jiahao lu","user":"jiahao97","type":"user"},"name":"Jiahao Lu","status":"claimed_verified","statusLastChangedAt":"2025-05-29T07:47:27.061Z","hidden":false},{"_id":"675265f58be943d9cb551727","name":"Chuanxin Tang","hidden":false},{"_id":"675265f58be943d9cb551728","name":"Bo An","hidden":false},{"_id":"675265f58be943d9cb551729","name":"Shuicheng Yan","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/63db5dc49f2687298a1547bf/dUdUF4NvpNvZPuX59VKhd.mp4","https://cdn-uploads.huggingface.co/production/uploads/63db5dc49f2687298a1547bf/2UY8S34f4DKCUNCJZri8s.mp4","https://cdn-uploads.huggingface.co/production/uploads/63db5dc49f2687298a1547bf/MRND-NAfGCdUGkyCndYfQ.mp4","https://cdn-uploads.huggingface.co/production/uploads/63db5dc49f2687298a1547bf/E133y3IFK11oAs6bKciWF.mp4","https://cdn-uploads.huggingface.co/production/uploads/63db5dc49f2687298a1547bf/mtDz1rNbpXUHYWo3Tn3Tq.mp4"],"publishedAt":"2024-12-05T18:57:26.000Z","submittedOnDailyAt":"2024-12-06T12:58:41.069Z","title":"MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation","submittedOnDailyBy":{"_id":"63db5dc49f2687298a1547bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db5dc49f2687298a1547bf/xVFi0kRkYud191cQgma16.jpeg","isPro":false,"fullname":"Longtao Zheng","user":"ltzheng","type":"user"},"summary":"Recent advances in video diffusion models have unlocked new potential for\nrealistic audio-driven talking video generation. However, achieving seamless\naudio-lip synchronization, maintaining long-term identity consistency, and\nproducing natural, audio-aligned expressions in generated talking videos remain\nsignificant challenges. To address these challenges, we propose Memory-guided\nEMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation\napproach to generate identity-consistent and expressive talking videos. Our\napproach is built around two key modules: (1) a memory-guided temporal module,\nwhich enhances long-term identity consistency and motion smoothness by\ndeveloping memory states to store information from a longer past context to\nguide temporal modeling via linear attention; and (2) an emotion-aware audio\nmodule, which replaces traditional cross attention with multi-modal attention\nto enhance audio-video interaction, while detecting emotions from audio to\nrefine facial expressions via emotion adaptive layer norm. Extensive\nquantitative and qualitative results demonstrate that MEMO generates more\nrealistic talking videos across diverse image and audio types, outperforming\nstate-of-the-art methods in overall quality, audio-lip synchronization,\nidentity consistency, and expression-emotion alignment.","upvotes":10,"discussionId":"675265f88be943d9cb5517de","ai_summary":"Memory-guided EMOtion-aware diffusion (MEMO) enhances the realism of audio-driven talking videos by ensuring audio-lip synchronization, long-term identity consistency, and natural expression alignment.","ai_keywords":["memory-guided temporal module","linear attention","emotion-aware audio module","multi-modal attention","emotion adaptive layer norm","audio-lip synchronization","identity consistency","expression-emotion alignment"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6555125a4f361968f0e3aad7","avatarUrl":"/avatars/e7692d82804338f21ecdc6e731f5c5ea.svg","isPro":false,"fullname":"marinaretikof","user":"marinaretik","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6362ddb7d3be91534c30bfd6","avatarUrl":"/avatars/dac76ebd3b8a08099497ec0b0524bc7c.svg","isPro":false,"fullname":"Art Atk","user":"ArtAtk","type":"user"},{"_id":"64f955c582673b2a07fbf0ad","avatarUrl":"/avatars/1c98c8be61f6580c1e4ee698fa5c0716.svg","isPro":false,"fullname":"hongyu","user":"learn12138","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"6351e5bb3734c6e8a5c1bec1","avatarUrl":"/avatars/a784a51b369b197398575c3afbd5ceab.svg","isPro":false,"fullname":"Han-Bit Kang","user":"hbkang","type":"user"},{"_id":"61868ce808aae0b5499a2a95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61868ce808aae0b5499a2a95/F6BA0anbsoY_Z7M1JrwOe.jpeg","isPro":true,"fullname":"Sylvain Filoni","user":"fffiloni","type":"user"},{"_id":"6093a02dc4a92d63a91c5236","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6093a02dc4a92d63a91c5236/yUte6V0FU0BvVFAbON-9n.jpeg","isPro":true,"fullname":"Diwank Tomer","user":"diwank","type":"user"},{"_id":"6358edff3b3638bdac83f7ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666772404424-noauth.jpeg","isPro":false,"fullname":"Pratyay Banerjee","user":"Neilblaze","type":"user"},{"_id":"67b19db3479dba2514550c63","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b19db3479dba2514550c63/EuyYSboWgWNyqjGnuPPV3.jpeg","isPro":false,"fullname":"Baxter","user":"FrederickCole","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2412.04448

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Published on Dec 5, 2024
· Submitted by
Longtao Zheng
on Dec 6, 2024
Authors:
,
,
,
,

Abstract

Memory-guided EMOtion-aware diffusion (MEMO) enhances the realism of audio-driven talking videos by ensuring audio-lip synchronization, long-term identity consistency, and natural expression alignment.

AI-generated summary

Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.

Community

Paper author Paper submitter

MEMO is a state-of-the-art open-weight model for audio-driven talking video generation.

Project Page: https://memoavatar.github.io

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 4

Collections including this paper 4