Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - LivePhoto: Real Image Animation with Text-guided Motion Control
\n","updatedAt":"2024-07-31T09:36:54.114Z","author":{"_id":"657d3eaa2bffc5568ac7badc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/VtgOahs9rAOELE1IXiBJc.jpeg","fullname":"Azhar rashid ","name":"Azhar0342","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.2180592566728592},"editors":["Azhar0342"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/VtgOahs9rAOELE1IXiBJc.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2312.02928","authors":[{"_id":"656fdb0f2cf29b89e91583df","user":{"_id":"644a1b6401e18bf93a6f45c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644a1b6401e18bf93a6f45c1/P0i_CgCrIzOS2tYRlxoE9.png","isPro":false,"fullname":"xichen","user":"xichenhku","type":"user"},"name":"Xi Chen","status":"claimed_verified","statusLastChangedAt":"2023-12-07T08:30:00.996Z","hidden":false},{"_id":"656fdb0f2cf29b89e91583e0","user":{"_id":"6479925ab77e18dbf640bd67","avatarUrl":"/avatars/bb52ecd22ca4b49157f8668be35409e7.svg","isPro":false,"fullname":"Zhiheng Liu","user":"Johanan0528","type":"user"},"name":"Zhiheng Liu","status":"admin_assigned","statusLastChangedAt":"2023-12-06T12:00:10.882Z","hidden":false},{"_id":"656fdb0f2cf29b89e91583e1","user":{"_id":"641073eb928400b416444af5","avatarUrl":"/avatars/0eed14a8929682f182e474f3693fc442.svg","isPro":false,"fullname":"Chenmengting","user":"MengTingChen","type":"user"},"name":"Mengting Chen","status":"admin_assigned","statusLastChangedAt":"2023-12-06T12:00:16.991Z","hidden":false},{"_id":"656fdb0f2cf29b89e91583e2","user":{"_id":"64a54e468cfaa458bd6844bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a54e468cfaa458bd6844bf/5Gmf4tAr59GNl-2VaZDbu.png","isPro":false,"fullname":"Yutong Feng","user":"fengyutong","type":"user"},"name":"Yutong Feng","status":"claimed_verified","statusLastChangedAt":"2023-12-19T16:01:27.787Z","hidden":false},{"_id":"656fdb0f2cf29b89e91583e3","name":"Yu Liu","hidden":false},{"_id":"656fdb0f2cf29b89e91583e4","name":"Yujun Shen","hidden":false},{"_id":"656fdb0f2cf29b89e91583e5","name":"Hengshuang Zhao","hidden":false}],"publishedAt":"2023-12-05T17:59:52.000Z","submittedOnDailyAt":"2023-12-05T23:53:13.235Z","title":"LivePhoto: Real Image Animation with Text-guided Motion Control","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Despite the recent progress in text-to-video generation, existing studies\nusually overlook the issue that only spatial contents but not temporal motions\nin synthesized videos are under the control of text. Towards such a challenge,\nthis work presents a practical system, named LivePhoto, which allows users to\nanimate an image of their interest with text descriptions. We first establish a\nstrong baseline that helps a well-learned text-to-image generator (i.e., Stable\nDiffusion) take an image as a further input. We then equip the improved\ngenerator with a motion module for temporal modeling and propose a carefully\ndesigned training pipeline to better link texts and motions. In particular,\nconsidering the facts that (1) text can only describe motions roughly (e.g.,\nregardless of the moving speed) and (2) text may include both content and\nmotion descriptions, we introduce a motion intensity estimation module as well\nas a text re-weighting module to reduce the ambiguity of text-to-motion\nmapping. Empirical evidence suggests that our approach is capable of well\ndecoding motion-related textual instructions into videos, such as actions,\ncamera movements, or even conjuring new contents from thin air (e.g., pouring\nwater into an empty glass). Interestingly, thanks to the proposed intensity\nlearning mechanism, our system offers users an additional control signal (i.e.,\nthe motion intensity) besides text for video customization.","upvotes":18,"discussionId":"656fdb112cf29b89e9158418","ai_summary":"A text-to-video system, LivePhoto, uses a text-to-image generator enhanced with a motion module to accurately translate textual motion descriptions into videos.","ai_keywords":["text-to-video generation","spatial contents","temporal motions","text-to-image generator","Stable Diffusion","motion module","temporal modeling","training pipeline","motion intensity estimation module","text re-weighting module","motion-related textual instructions","video customization"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"635f16eda81c7f7424a58996","avatarUrl":"/avatars/e25928188c3c9b7ac3d1abd69bcc39d5.svg","isPro":false,"fullname":"I Am Imagen","user":"imagen","type":"user"},{"_id":"644a1b6401e18bf93a6f45c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644a1b6401e18bf93a6f45c1/P0i_CgCrIzOS2tYRlxoE9.png","isPro":false,"fullname":"xichen","user":"xichenhku","type":"user"},{"_id":"649ac0ff69657756819bb7a6","avatarUrl":"/avatars/3e3c842f7629a55738c2532573e9e0a5.svg","isPro":false,"fullname":"sami abdullah","user":"Sickostro","type":"user"},{"_id":"648ac690a38874f578acbd5c","avatarUrl":"/avatars/c33c07f7d9edcc39e1c4a1723edb7539.svg","isPro":false,"fullname":"Alex Genovese","user":"alexgenovese","type":"user"},{"_id":"61344ab1d19e49a35751462e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg","isPro":false,"fullname":"Puffy Bird","user":"puffy310","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"6343f83791049e1bce85373e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665398834110-noauth.png","isPro":false,"fullname":"Zhang ning","user":"pe65374","type":"user"},{"_id":"62fc99cccad078c79728fbed","avatarUrl":"/avatars/76cb87f0a2e9b3de41e3f1b18ab7cb2f.svg","isPro":false,"fullname":"liuyuyuil","user":"liuyuyuil","type":"user"},{"_id":"64f3d15b77b0eb97ea1ec8b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f3d15b77b0eb97ea1ec8b2/y_3DjdOr5reXzTvHwn-xT.jpeg","isPro":false,"fullname":"Christopher Snyder","user":"csnyder","type":"user"},{"_id":"6266513d539521e602b5dc3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6266513d539521e602b5dc3a/7ZU_GyMBzrFHcHDoAkQlp.png","isPro":false,"fullname":"Ameer Azam","user":"ameerazam08","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A text-to-video system, LivePhoto, uses a text-to-image generator enhanced with a motion module to accurately translate textual motion descriptions into videos.
AI-generated summary
Despite the recent progress in text-to-video generation, existing studies
usually overlook the issue that only spatial contents but not temporal motions
in synthesized videos are under the control of text. Towards such a challenge,
this work presents a practical system, named LivePhoto, which allows users to
animate an image of their interest with text descriptions. We first establish a
strong baseline that helps a well-learned text-to-image generator (i.e., Stable
Diffusion) take an image as a further input. We then equip the improved
generator with a motion module for temporal modeling and propose a carefully
designed training pipeline to better link texts and motions. In particular,
considering the facts that (1) text can only describe motions roughly (e.g.,
regardless of the moving speed) and (2) text may include both content and
motion descriptions, we introduce a motion intensity estimation module as well
as a text re-weighting module to reduce the ambiguity of text-to-motion
mapping. Empirical evidence suggests that our approach is capable of well
decoding motion-related textual instructions into videos, such as actions,
camera movements, or even conjuring new contents from thin air (e.g., pouring
water into an empty glass). Interestingly, thanks to the proposed intensity
learning mechanism, our system offers users an additional control signal (i.e.,
the motion intensity) besides text for video customization.
an interesting approaching separating spatial content and movable objects. Rendering spatial content and predict object motion with separate approach is indeed a better and more efficient way to perform Text-to-video.