Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-01-24T01:38:38.126Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6926066279411316},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.16208","authors":[{"_id":"6972ea8bfb12c92b735b74a8","name":"Shengbang Tong","hidden":false},{"_id":"6972ea8bfb12c92b735b74a9","name":"Boyang Zheng","hidden":false},{"_id":"6972ea8bfb12c92b735b74aa","user":{"_id":"64249f76d476e4ad55665d59","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64249f76d476e4ad55665d59/msLU1k5dcdRuf2FwaWvCR.jpeg","isPro":false,"fullname":"Ziteng Wang","user":"AustinWang0330","type":"user"},"name":"Ziteng Wang","status":"claimed_verified","statusLastChangedAt":"2026-01-23T09:38:01.136Z","hidden":false},{"_id":"6972ea8bfb12c92b735b74ab","name":"Bingda Tang","hidden":false},{"_id":"6972ea8bfb12c92b735b74ac","name":"Nanye Ma","hidden":false},{"_id":"6972ea8bfb12c92b735b74ad","user":{"_id":"626dc5105f7327906f0b2a4e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/626dc5105f7327906f0b2a4e/QCSzuwYqsv8ozRnusVb-F.jpeg","isPro":true,"fullname":"Ellis Brown","user":"ellisbrown","type":"user"},"name":"Ellis Brown","status":"claimed_verified","statusLastChangedAt":"2026-01-23T20:15:10.637Z","hidden":false},{"_id":"6972ea8bfb12c92b735b74ae","user":{"_id":"6304baf041387c7f1177a5d2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304baf041387c7f1177a5d2/cQgCR8AsrMUaF2QVh97I9.jpeg","isPro":true,"fullname":"Jihan Yang","user":"jihanyang","type":"user"},"name":"Jihan Yang","status":"claimed_verified","statusLastChangedAt":"2026-01-26T08:32:46.014Z","hidden":false},{"_id":"6972ea8bfb12c92b735b74af","name":"Rob Fergus","hidden":false},{"_id":"6972ea8bfb12c92b735b74b0","name":"Yann LeCun","hidden":false},{"_id":"6972ea8bfb12c92b735b74b1","name":"Saining Xie","hidden":false}],"publishedAt":"2026-01-22T18:58:16.000Z","submittedOnDailyAt":"2026-01-23T00:57:17.761Z","title":"Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders","submittedOnDailyBy":{"_id":"6434226da4c9c55871a78052","avatarUrl":"/avatars/3309832b3115bc6ad08ae1d10f43118b.svg","isPro":false,"fullname":"BoYang Zheng","user":"bytetriper","type":"user"},"summary":"Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.","upvotes":52,"discussionId":"6972ea8cfb12c92b735b74b2","projectPage":"https://rae-dit.github.io/scale-rae/","githubRepo":"https://github.com/ZitengWangNYU/Scale-RAE","githubRepoAddedBy":"user","ai_summary":"Representation Autoencoders (RAEs) demonstrate superior performance over VAEs in large-scale text-to-image generation, showing improved stability, faster convergence, and better quality while enabling unified multimodal reasoning in shared representation spaces.","ai_keywords":["representation autoencoders","diffusion modeling","semantic latent spaces","text-to-image generation","frozen representation encoder","SigLIP-2","noise scheduling","diffusion transformers","pretraining","finetuning","catastrophic overfitting","multimodal model","shared representation space"],"githubStars":208,"organization":{"_id":"662741612ada5b77e310d171","name":"nyu-visionx","fullname":"VISIONx @ NYU","avatar":"https://cdn-uploads.huggingface.co/production/uploads/626dc5105f7327906f0b2a4e/Kn-QtZjE6TJE-syTndXIW.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6434226da4c9c55871a78052","avatarUrl":"/avatars/3309832b3115bc6ad08ae1d10f43118b.svg","isPro":false,"fullname":"BoYang Zheng","user":"bytetriper","type":"user"},{"_id":"661b9ac57cfb7bcb3057a578","avatarUrl":"/avatars/f8afaa8eaad3a1e5963a4feebec3f7ab.svg","isPro":false,"fullname":"Yanheng He","user":"henryhe0123","type":"user"},{"_id":"6304baf041387c7f1177a5d2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304baf041387c7f1177a5d2/cQgCR8AsrMUaF2QVh97I9.jpeg","isPro":true,"fullname":"Jihan Yang","user":"jihanyang","type":"user"},{"_id":"654a6c59f8ebcec54510ee56","avatarUrl":"/avatars/31089658df2d754ab6e4f6ed2750cc1e.svg","isPro":false,"fullname":"Anjali W Gupta","user":"anjaliwgupta","type":"user"},{"_id":"6374cbb7255276f3a22b4b35","avatarUrl":"/avatars/7cf1bbb83447441e5fa2e1e4fcf7617b.svg","isPro":true,"fullname":"Peter Tong","user":"tsbpp","type":"user"},{"_id":"627ccf058b4e56cfc2716425","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652346592327-noauth.jpeg","isPro":false,"fullname":"Shusheng Yang","user":"ShushengYang","type":"user"},{"_id":"64c2c45ae818eec6128fdda3","avatarUrl":"/avatars/d4399e25e6399345e263c7902789047e.svg","isPro":false,"fullname":"Junwan Kim","user":"junwann","type":"user"},{"_id":"60cc389a0844fb1605fef405","avatarUrl":"/avatars/ec11f85735e0525439e8821cf6d12e53.svg","isPro":false,"fullname":"Jaskirat Singh","user":"jsingh","type":"user"},{"_id":"6596422646624a86ff3b3bda","avatarUrl":"/avatars/216e12b77e45ac5f1fa20932f5745411.svg","isPro":false,"fullname":"Saining Xie","user":"sainx","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63f233820a16587ea967adc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f233820a16587ea967adc2/1nSoZofPV7UseXzjI2qAH.png","isPro":false,"fullname":"Sihan XU","user":"sihanxu","type":"user"},{"_id":"63f1d16fbe95ed4c9a9418fe","avatarUrl":"/avatars/a1bdfa97323693808f2f16ec74698ed3.svg","isPro":false,"fullname":"Yang Yue","user":"yueyang2000","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"662741612ada5b77e310d171","name":"nyu-visionx","fullname":"VISIONx @ NYU","avatar":"https://cdn-uploads.huggingface.co/production/uploads/626dc5105f7327906f0b2a4e/Kn-QtZjE6TJE-syTndXIW.jpeg"}}">
Representation Autoencoders (RAEs) demonstrate superior performance over VAEs in large-scale text-to-image generation, showing improved stability, faster convergence, and better quality while enabling unified multimodal reasoning in shared representation spaces.
AI-generated summary
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.