Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing (2025)
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation (2025)
Visual Generation Tuning (2025)
PixelDiT: Pixel Diffusion Transformers for Image Generation (2025)
REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion (2025)
RecTok: Reconstruction Distillation along Rectified Flow (2025)
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction (2025)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-24T01:38:38.126Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6926066279411316},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.16208","authors":[{"_id":"6972ea8bfb12c92b735b74a8","name":"Shengbang Tong","hidden":false},{"_id":"6972ea8bfb12c92b735b74a9","name":"Boyang Zheng","hidden":false},{"_id":"6972ea8bfb12c92b735b74aa","user":{"_id":"64249f76d476e4ad55665d59","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64249f76d476e4ad55665d59/msLU1k5dcdRuf2FwaWvCR.jpeg","isPro":false,"fullname":"Ziteng Wang","user":"AustinWang0330","type":"user"},"name":"Ziteng Wang","status":"claimed_verified","statusLastChangedAt":"2026-01-23T09:38:01.136Z","hidden":false},{"_id":"6972ea8bfb12c92b735b74ab","name":"Bingda Tang","hidden":false},{"_id":"6972ea8bfb12c92b735b74ac","name":"Nanye Ma","hidden":false},{"_id":"6972ea8bfb12c92b735b74ad","user":{"_id":"626dc5105f7327906f0b2a4e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/626dc5105f7327906f0b2a4e/QCSzuwYqsv8ozRnusVb-F.jpeg","isPro":true,"fullname":"Ellis Brown","user":"ellisbrown","type":"user"},"name":"Ellis Brown","status":"claimed_verified","statusLastChangedAt":"2026-01-23T20:15:10.637Z","hidden":false},{"_id":"6972ea8bfb12c92b735b74ae","user":{"_id":"6304baf041387c7f1177a5d2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304baf041387c7f1177a5d2/cQgCR8AsrMUaF2QVh97I9.jpeg","isPro":true,"fullname":"Jihan Yang","user":"jihanyang","type":"user"},"name":"Jihan Yang","status":"claimed_verified","statusLastChangedAt":"2026-01-26T08:32:46.014Z","hidden":false},{"_id":"6972ea8bfb12c92b735b74af","name":"Rob Fergus","hidden":false},{"_id":"6972ea8bfb12c92b735b74b0","name":"Yann LeCun","hidden":false},{"_id":"6972ea8bfb12c92b735b74b1","name":"Saining Xie","hidden":false}],"publishedAt":"2026-01-22T18:58:16.000Z","submittedOnDailyAt":"2026-01-23T00:57:17.761Z","title":"Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders","submittedOnDailyBy":{"_id":"6434226da4c9c55871a78052","avatarUrl":"/avatars/3309832b3115bc6ad08ae1d10f43118b.svg","isPro":false,"fullname":"BoYang Zheng","user":"bytetriper","type":"user"},"summary":"Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.","upvotes":52,"discussionId":"6972ea8cfb12c92b735b74b2","projectPage":"https://rae-dit.github.io/scale-rae/","githubRepo":"https://github.com/ZitengWangNYU/Scale-RAE","githubRepoAddedBy":"user","ai_summary":"Representation Autoencoders (RAEs) demonstrate superior performance over VAEs in large-scale text-to-image generation, showing improved stability, faster convergence, and better quality while enabling unified multimodal reasoning in shared representation spaces.","ai_keywords":["representation autoencoders","diffusion modeling","semantic latent spaces","text-to-image generation","frozen representation encoder","SigLIP-2","noise scheduling","diffusion transformers","pretraining","finetuning","catastrophic overfitting","multimodal model","shared representation space"],"githubStars":208,"organization":{"_id":"662741612ada5b77e310d171","name":"nyu-visionx","fullname":"VISIONx @ NYU","avatar":"https://cdn-uploads.huggingface.co/production/uploads/626dc5105f7327906f0b2a4e/Kn-QtZjE6TJE-syTndXIW.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6434226da4c9c55871a78052","avatarUrl":"/avatars/3309832b3115bc6ad08ae1d10f43118b.svg","isPro":false,"fullname":"BoYang Zheng","user":"bytetriper","type":"user"},{"_id":"661b9ac57cfb7bcb3057a578","avatarUrl":"/avatars/f8afaa8eaad3a1e5963a4feebec3f7ab.svg","isPro":false,"fullname":"Yanheng He","user":"henryhe0123","type":"user"},{"_id":"6304baf041387c7f1177a5d2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304baf041387c7f1177a5d2/cQgCR8AsrMUaF2QVh97I9.jpeg","isPro":true,"fullname":"Jihan Yang","user":"jihanyang","type":"user"},{"_id":"654a6c59f8ebcec54510ee56","avatarUrl":"/avatars/31089658df2d754ab6e4f6ed2750cc1e.svg","isPro":false,"fullname":"Anjali W Gupta","user":"anjaliwgupta","type":"user"},{"_id":"6374cbb7255276f3a22b4b35","avatarUrl":"/avatars/7cf1bbb83447441e5fa2e1e4fcf7617b.svg","isPro":true,"fullname":"Peter Tong","user":"tsbpp","type":"user"},{"_id":"627ccf058b4e56cfc2716425","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652346592327-noauth.jpeg","isPro":false,"fullname":"Shusheng Yang","user":"ShushengYang","type":"user"},{"_id":"64c2c45ae818eec6128fdda3","avatarUrl":"/avatars/d4399e25e6399345e263c7902789047e.svg","isPro":false,"fullname":"Junwan Kim","user":"junwann","type":"user"},{"_id":"60cc389a0844fb1605fef405","avatarUrl":"/avatars/ec11f85735e0525439e8821cf6d12e53.svg","isPro":false,"fullname":"Jaskirat Singh","user":"jsingh","type":"user"},{"_id":"6596422646624a86ff3b3bda","avatarUrl":"/avatars/216e12b77e45ac5f1fa20932f5745411.svg","isPro":false,"fullname":"Saining Xie","user":"sainx","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63f233820a16587ea967adc2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f233820a16587ea967adc2/1nSoZofPV7UseXzjI2qAH.png","isPro":false,"fullname":"Sihan XU","user":"sihanxu","type":"user"},{"_id":"63f1d16fbe95ed4c9a9418fe","avatarUrl":"/avatars/a1bdfa97323693808f2f16ec74698ed3.svg","isPro":false,"fullname":"Yang Yue","user":"yueyang2000","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"662741612ada5b77e310d171","name":"nyu-visionx","fullname":"VISIONx @ NYU","avatar":"https://cdn-uploads.huggingface.co/production/uploads/626dc5105f7327906f0b2a4e/Kn-QtZjE6TJE-syTndXIW.jpeg"}}">

Papers

arxiv:2601.16208

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Published on Jan 22

· Submitted by

BoYang Zheng on Jan 23

VISIONx @ NYU

Upvote

Authors:

Ziteng Wang ,

Ellis Brown ,

Jihan Yang ,

Abstract

Representation Autoencoders (RAEs) demonstrate superior performance over VAEs in large-scale text-to-image generation, showing improved stability, faster convergence, and better quality while enabling unified multimodal reasoning in shared representation spaces.

AI-generated summary

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.