Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
https://github.com/THUDM/CogVideo\n","updatedAt":"2024-08-13T03:13:35.534Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9179,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5627278089523315},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[{"reaction":"๐ฅ","users":["AdinaY","DistractedHue","Manipulator","RELIC","wileewang","Wuvin","tolgacangoz"],"count":7},{"reaction":"โค๏ธ","users":["Manipulator"],"count":1}],"isReport":false}},{"id":"66bb2fb674d4fb4d57963ea8","author":{"_id":"6647e18690135abe9b9c02da","avatarUrl":"/avatars/19012c0b290b74065cf8395824814b80.svg","fullname":"Vito Po","name":"HandsomeMagyar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-08-13T10:04:38.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-08-13T14:07:03.828Z","author":{"_id":"6647e18690135abe9b9c02da","avatarUrl":"/avatars/19012c0b290b74065cf8395824814b80.svg","fullname":"Vito Po","name":"HandsomeMagyar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"66bc0946ea0ff5ecf8808cd4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-08-14T01:32:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VidGen-1M: A Large-Scale Dataset for Text-to-video Generation](https://huggingface.co/papers/2408.02629) (2024)\n* [OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation](https://huggingface.co/papers/2407.02371) (2024)\n* [VIMI: Grounding Video Generation through Multi-modal Instruction](https://huggingface.co/papers/2407.06304) (2024)\n* [ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models](https://huggingface.co/papers/2406.10981) (2024)\n* [MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions](https://huggingface.co/papers/2407.06358) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n","updatedAt":"2024-08-16T07:46:19.759Z","author":{"_id":"5f7fbd813e94f16a85448745","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649681653581-5f7fbd813e94f16a85448745.jpeg","fullname":"Sayak Paul","name":"sayakpaul","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":853,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4130525588989258},"editors":["sayakpaul"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1649681653581-5f7fbd813e94f16a85448745.jpeg"],"reactions":[{"reaction":"๐","users":["tolgacangoz"],"count":1}],"isReport":false}},{"id":"66cf2c0f103ac479860f35e2","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false},"createdAt":"2024-08-28T13:54:23.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Question from the community @tengjiayan @keg-yzy @zwd125:\nPlz ask them if they or someone can make it available in Open WebUI via ComfyUI, Auto1111 or other methods to run it locally on our machines\n","html":"
Question from the community \n\n@tengjiayan\n\t\n\n@keg-yzy\n\t\n\n@zwd125\n\t: Plz ask them if they or someone can make it available in Open WebUI via ComfyUI, Auto1111 or other methods to run it locally on our machines
\n","updatedAt":"2024-08-29T09:13:06.867Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7900741100311279},"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[{"reaction":"โ","users":["akhaliq"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2408.06072","authors":[{"_id":"66bacf59c9b2ab14b398f761","user":{"_id":"6322e02496d6c8518eb05292","avatarUrl":"/avatars/6b5d317f9ccb0046e40918c7861b4cb5.svg","isPro":false,"fullname":"Zhuoyi Yang","user":"keg-yzy","type":"user"},"name":"Zhuoyi Yang","status":"admin_assigned","statusLastChangedAt":"2024-08-29T09:11:01.293Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f762","user":{"_id":"65228733377bffdc59a10117","avatarUrl":"/avatars/6eec07553658ab22f8058caa0bfbed49.svg","isPro":false,"fullname":"tengjiayan","user":"tengjiayan","type":"user"},"name":"Jiayan Teng","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:45:51.899Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f763","user":{"_id":"650aaaf1634e02df56dfd231","avatarUrl":"/avatars/4643ab23721d4ed6aeb1ebbc717adc43.svg","isPro":false,"fullname":"Wendi Zheng","user":"zwd125","type":"user"},"name":"Wendi Zheng","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:47:07.561Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f764","name":"Ming Ding","hidden":false},{"_id":"66bacf59c9b2ab14b398f765","user":{"_id":"6406db5cd684369027166986","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6406db5cd684369027166986/Zl-orrGcbY0RbfjfKszn1.jpeg","isPro":false,"fullname":"Shiyu Huang","user":"ShiyuHuang","type":"user"},"name":"Shiyu Huang","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:47:33.688Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f766","user":{"_id":"62d7b131f6e8ba66107af761","avatarUrl":"/avatars/f1c5df47aef69c824fd166722df8f670.svg","isPro":false,"fullname":"Jiazheng Xu","user":"xujz0703","type":"user"},"name":"Jiazheng Xu","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:47:56.691Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f767","user":{"_id":"659b6c50b0f43ed69fe09d56","avatarUrl":"/avatars/8ea56c56263595a9f7555f2c2520641a.svg","isPro":false,"fullname":"ๆจ่ฟๆ","user":"yangyuanming","type":"user"},"name":"Yuanming Yang","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:48:10.864Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f768","user":{"_id":"62ecd24cb8764c7738ef2793","avatarUrl":"/avatars/c1b80b5c55f9d652c1aaac7919e1fa32.svg","isPro":false,"fullname":"Wenyi Hong","user":"wenyi","type":"user"},"name":"Wenyi Hong","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:48:17.739Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f769","name":"Xiaohan Zhang","hidden":false},{"_id":"66bacf59c9b2ab14b398f76a","user":{"_id":"64d996d6bcab729cb400cb70","avatarUrl":"/avatars/3db57c8b643fba8302ed39d8cf0f4ddb.svg","isPro":false,"fullname":"Guanyu Feng","user":"jiguanglizipao","type":"user"},"name":"Guanyu Feng","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:48:57.709Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f76b","user":{"_id":"634e4670a51d5df8c2d92fce","avatarUrl":"/avatars/c52d7150b4de6a2eb2d83b345d35cbc2.svg","isPro":false,"fullname":"Da Yin","user":"DaYin","type":"user"},"name":"Da Yin","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:49:04.537Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f76c","user":{"_id":"6438bca4a5e10f6d58694b47","avatarUrl":"/avatars/3aeb25fbc73c5cab1265e13d11adfb76.svg","isPro":false,"fullname":"Xiaotao Gu","user":"xgeric","type":"user"},"name":"Xiaotao Gu","status":"claimed_verified","statusLastChangedAt":"2025-07-02T09:25:11.545Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f76d","user":{"_id":"643507d1ce04fdb57e9d7e05","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643507d1ce04fdb57e9d7e05/QuCFAO3v7G3LrVGusqFj1.png","isPro":false,"fullname":"zR","user":"ZHANGYUXUAN-zR","type":"user"},"name":"Yuxuan Zhang","status":"claimed_verified","statusLastChangedAt":"2024-08-28T15:29:07.523Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f76e","user":{"_id":"63403c75b8b51e0098d3555c","avatarUrl":"/avatars/85725280b63788e387fe73319a54164d.svg","isPro":false,"fullname":"็็ปดๆฑ","user":"mactavish91","type":"user"},"name":"Weihan Wang","status":"claimed_verified","statusLastChangedAt":"2025-07-02T12:23:31.640Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f76f","user":{"_id":"65acc5afe2a2c8635614de43","avatarUrl":"/avatars/c5fce792792cc0b52ed7475d72460c58.svg","isPro":false,"fullname":"Yean Cheng","user":"LiquidAmmonia","type":"user"},"name":"Yean Cheng","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:50:14.803Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f770","name":"Ting Liu","hidden":false},{"_id":"66bacf59c9b2ab14b398f771","name":"Bin Xu","hidden":false},{"_id":"66bacf59c9b2ab14b398f772","user":{"_id":"640e73bdfdeaae1390857b62","avatarUrl":"/avatars/cd6779e30f716002a7838ed93d5c0754.svg","isPro":false,"fullname":"Yuxiao Dong","user":"yuxiaod","type":"user"},"name":"Yuxiao Dong","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:51:14.639Z","hidden":false},{"_id":"66bacf59c9b2ab14b398f773","user":{"_id":"640dff05474aa6f89556677e","avatarUrl":"/avatars/1b4591c7322d649c797b3125148f1915.svg","isPro":false,"fullname":"Jie Tang","user":"jerytang","type":"user"},"name":"Jie Tang","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:51:32.360Z","hidden":false}],"publishedAt":"2024-08-12T11:47:11.000Z","submittedOnDailyAt":"2024-08-13T01:43:35.529Z","title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We introduce CogVideoX, a large-scale diffusion transformer model designed\nfor generating videos based on text prompts. To efficently model video data, we\npropose to levearge a 3D Variational Autoencoder (VAE) to compress videos along\nboth spatial and temporal dimensions. To improve the text-video alignment, we\npropose an expert transformer with the expert adaptive LayerNorm to facilitate\nthe deep fusion between the two modalities. By employing a progressive training\ntechnique, CogVideoX is adept at producing coherent, long-duration videos\ncharacterized by significant motions. In addition, we develop an effective\ntext-video data processing pipeline that includes various data preprocessing\nstrategies and a video captioning method. It significantly helps enhance the\nperformance of CogVideoX, improving both generation quality and semantic\nalignment. Results show that CogVideoX demonstrates state-of-the-art\nperformance across both multiple machine metrics and human evaluations. The\nmodel weights of both the 3D Causal VAE and CogVideoX are publicly available at\nhttps://github.com/THUDM/CogVideo.","upvotes":38,"discussionId":"66bacf5dc9b2ab14b398f85c","projectPage":"https://yzy-thu.github.io/CogVideoX-demo","githubRepo":"https://github.com/thudm/cogvideo","githubRepoAddedBy":"auto","ai_summary":"CogVideoX is a large-scale diffusion transformer model using a 3D Variational Autoencoder and expert transformer for generating high-quality, coherent videos from text prompts.","ai_keywords":["diffusion transformer","3D Variational Autoencoder","expert transformer","expert adaptive LayerNorm","progressive training","video captioning"],"githubStars":12436},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"635964636a61954080850e1d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635964636a61954080850e1d/0bfExuDTrHTtm8c-40cDM.png","isPro":false,"fullname":"William Lamkin","user":"phanes","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"630412d57373aacccd88af95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670594087059-630412d57373aacccd88af95.jpeg","isPro":true,"fullname":"Yasunori Ozaki","user":"alfredplpl","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"637f0eb22438d7485b8ef5d7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637f0eb22438d7485b8ef5d7/70h7dekqj7LuBobOXckmJ.jpeg","isPro":false,"fullname":"Ming Li","user":"limingcv","type":"user"},{"_id":"6322e02496d6c8518eb05292","avatarUrl":"/avatars/6b5d317f9ccb0046e40918c7861b4cb5.svg","isPro":false,"fullname":"Zhuoyi Yang","user":"keg-yzy","type":"user"},{"_id":"64316678dec2a70d8130aa9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/EAS7OJwvyInle8J7IIBbw.jpeg","isPro":false,"fullname":"Levi Sverdlov","user":"Sverd","type":"user"},{"_id":"643c985f25681c3afab03789","avatarUrl":"/avatars/2e476fce12c4d48bb1f0ac3e68ddc209.svg","isPro":false,"fullname":"Kaio Ken","user":"kaiokendev","type":"user"},{"_id":"635cada2c017767a629db012","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667018139063-noauth.jpeg","isPro":false,"fullname":"Ojasvi Singh Yadav","user":"ojasvisingh786","type":"user"},{"_id":"64ed568ccf6118a9379a61b8","avatarUrl":"/avatars/6d040cbcb4a9b624cbe64c9d01cd5c88.svg","isPro":false,"fullname":"Yushi Bai","user":"bys0318","type":"user"},{"_id":"64074ab95e6d06cc2cf5904b","avatarUrl":"/avatars/1819ffcc36875ddbf8df81532d832a2b.svg","isPro":false,"fullname":"Sam Yamashita","user":"sotayamashita","type":"user"},{"_id":"65bf4b25719492167d707c92","avatarUrl":"/avatars/ac15a457fc7eafe3aaf716a3da36e6c2.svg","isPro":false,"fullname":"lewei","user":"lewei123","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
CogVideoX is a large-scale diffusion transformer model using a 3D Variational Autoencoder and expert transformer for generating high-quality, coherent videos from text prompts.
AI-generated summary
We introduce CogVideoX, a large-scale diffusion transformer model designed
for generating videos based on text prompts. To efficently model video data, we
propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along
both spatial and temporal dimensions. To improve the text-video alignment, we
propose an expert transformer with the expert adaptive LayerNorm to facilitate
the deep fusion between the two modalities. By employing a progressive training
technique, CogVideoX is adept at producing coherent, long-duration videos
characterized by significant motions. In addition, we develop an effective
text-video data processing pipeline that includes various data preprocessing
strategies and a video captioning method. It significantly helps enhance the
performance of CogVideoX, improving both generation quality and semantic
alignment. Results show that CogVideoX demonstrates state-of-the-art
performance across both multiple machine metrics and human evaluations. The
model weights of both the 3D Causal VAE and CogVideoX are publicly available at
https://github.com/THUDM/CogVideo.
Question from the community @tengjiayan@keg-yzy@zwd125: Plz ask them if they or someone can make it available in Open WebUI via ComfyUI, Auto1111 or other methods to run it locally on our machines