Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Matryoshka Diffusion Models
[go: Go Back, main page]

\n\t\t\n\t\n\t\n\t\tNext-Level Image and Video Generation: Matryoshka Diffusion Models!\n\t\n\n

\n\n

\n\t\n\t\t\n\t\n\t\n\t\tLinks 🔗:\n\t\n

\n

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

\n

By Arxflix
\"9t4iCUHx_400x400-1.jpg\"

\n","updatedAt":"2024-06-08T22:23:54.714Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":176,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5357985496520996},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2310.15111","authors":[{"_id":"6537db8f3af2f64a0d5df7e2","user":{"_id":"6164e72d73996c363c52e66d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1634002684894-noauth.png","isPro":false,"fullname":"Jiatao Gu","user":"thomagram","type":"user"},"name":"Jiatao Gu","status":"claimed_verified","statusLastChangedAt":"2025-12-10T09:13:27.005Z","hidden":false},{"_id":"6537db8f3af2f64a0d5df7e3","user":{"_id":"675bf04240726aac79ae8e7b","avatarUrl":"/avatars/8ae95255dfce91a6f5e76f14eae9e9e5.svg","isPro":false,"fullname":"Shuangfei Zhai","user":"shuangfei","type":"user"},"name":"Shuangfei Zhai","status":"claimed_verified","statusLastChangedAt":"2024-12-13T15:25:23.199Z","hidden":false},{"_id":"6537db8f3af2f64a0d5df7e4","name":"Yizhe Zhang","hidden":false},{"_id":"6537db8f3af2f64a0d5df7e5","user":{"_id":"6470d2247fd7ecdbd0ec3cc9","avatarUrl":"/avatars/52c5eca12499a1aa9bd49c43d4f20685.svg","isPro":false,"fullname":"Joshua M. Susskind","user":"jsusskind","type":"user"},"name":"Josh Susskind","status":"claimed_verified","statusLastChangedAt":"2024-11-22T16:30:56.489Z","hidden":false},{"_id":"6537db8f3af2f64a0d5df7e6","name":"Navdeep Jaitly","hidden":false}],"publishedAt":"2023-10-23T17:20:01.000Z","submittedOnDailyAt":"2023-10-24T13:28:28.745Z","title":"Matryoshka Diffusion Models","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Diffusion models are the de facto approach for generating high-quality images\nand videos, but learning high-dimensional models remains a formidable task due\nto computational and optimization challenges. Existing methods often resort to\ntraining cascaded models in pixel space or using a downsampled latent space of\na separately trained auto-encoder. In this paper, we introduce Matryoshka\nDiffusion Models(MDM), an end-to-end framework for high-resolution image and\nvideo synthesis. We propose a diffusion process that denoises inputs at\nmultiple resolutions jointly and uses a NestedUNet architecture where features\nand parameters for small-scale inputs are nested within those of large scales.\nIn addition, MDM enables a progressive training schedule from lower to higher\nresolutions, which leads to significant improvements in optimization for\nhigh-resolution generation. We demonstrate the effectiveness of our approach on\nvarious benchmarks, including class-conditioned image generation,\nhigh-resolution text-to-image, and text-to-video applications. Remarkably, we\ncan train a single pixel-space model at resolutions of up to 1024x1024 pixels,\ndemonstrating strong zero-shot generalization using the CC12M dataset, which\ncontains only 12 million images.","upvotes":45,"discussionId":"6537db943af2f64a0d5df875","githubRepo":"https://github.com/apple/ml-mdm","githubRepoAddedBy":"auto","ai_summary":"Matryoshka Diffusion Models use a NestedUNet architecture for joint denoising at multiple resolutions, enabling efficient high-resolution image and video synthesis.","ai_keywords":["diffusion models","high-resolution image and video synthesis","NestedUNet","end-to-end framework","progressive training schedule","class-conditioned image generation","high-resolution text-to-image","text-to-video","CC12M dataset"],"githubStars":514},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630412d57373aacccd88af95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670594087059-630412d57373aacccd88af95.jpeg","isPro":true,"fullname":"Yasunori Ozaki","user":"alfredplpl","type":"user"},{"_id":"61152a24e7a2566ae7d1a1b1","avatarUrl":"/avatars/8d2ed8cec0c0794aee917260e1cfeb28.svg","isPro":false,"fullname":"Massimiliano Pappa","user":"MaxPappa","type":"user"},{"_id":"63b8442769b7bd7324fe2c4a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1673020420776-noauth.png","isPro":false,"fullname":"Abdelhak farissi","user":"metageek","type":"user"},{"_id":"629f3b18ee05727ce328ccbe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669189789447-629f3b18ee05727ce328ccbe.jpeg","isPro":false,"fullname":"Kashif Rasul","user":"kashif","type":"user"},{"_id":"63e495a99db5da2dc1f679df","avatarUrl":"/avatars/e3b3f7702a5c8fd21a246272940e45b8.svg","isPro":false,"fullname":"Luming Tang","user":"lt453","type":"user"},{"_id":"61f44bab7eba274ea80b74ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61f44bab7eba274ea80b74ce/BRbKX1jephdZ7D44FATl4.jpeg","isPro":false,"fullname":"Hyoung-Kyu Song","user":"deepkyu","type":"user"},{"_id":"6362ddb7d3be91534c30bfd6","avatarUrl":"/avatars/dac76ebd3b8a08099497ec0b0524bc7c.svg","isPro":false,"fullname":"Art Atk","user":"ArtAtk","type":"user"},{"_id":"6512e852a76fd5945b19e9a1","avatarUrl":"/avatars/751526fbc0c939a972bea684937a34bf.svg","isPro":false,"fullname":"Aditi Khandelwal","user":"aditi184","type":"user"},{"_id":"636ac507e3ad78bc68b31cfe","avatarUrl":"/avatars/e6dd4027945909c7cf13c61807c78f23.svg","isPro":false,"fullname":"Anas Saeed","user":"SaeedAnas","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"62f847d692950415b63c6011","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660437733795-noauth.png","isPro":false,"fullname":"Yassine Ennaour","user":"Lyte","type":"user"},{"_id":"624bebf604abc7ebb01789af","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649143001781-624bebf604abc7ebb01789af.jpeg","isPro":true,"fullname":"Apolinário from multimodal AI art","user":"multimodalart","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Papers
arxiv:2310.15111

Matryoshka Diffusion Models

Published on Oct 23, 2023
· Submitted by
AK
on Oct 24, 2023
#1 Paper of the day
Authors:
,

Abstract

Matryoshka Diffusion Models use a NestedUNet architecture for joint denoising at multiple resolutions, enabling efficient high-resolution image and video synthesis.

AI-generated summary

Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.

Community

The model's grasp of form and structure seems remarkably strong for being trained on such a small dataset. It's on par with, if not better than SDXL in that regard! I imagine this has partially to do with the T5 encoder, but the architecture and progressive training certainly make a big difference.

I feel like if we combine this paper's architectural/training advancements with DALLE 3's strategy of training on highly detailed machine-generated captions, and scaled all of this up to something like LAION-2B, it could result in a very strong model.

This comment has been hidden
This comment has been hidden

Next-Level Image and Video Generation: Matryoshka Diffusion Models!

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 3

Spaces citing this paper 1

Collections including this paper 16