Librarian Bot. I found the following papers similar to this paper. \n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-05-17T01:36:28.653Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7118456959724426},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.10046","authors":[{"_id":"6826b00ed4c8b864e5ed1c0f","user":{"_id":"666de415ea11ab8f28560962","avatarUrl":"/avatars/c910b35e3128cf17ba1e45afff551820.svg","isPro":false,"fullname":"Bingda Tang","user":"ooutlierr","type":"user"},"name":"Bingda Tang","status":"admin_assigned","statusLastChangedAt":"2025-05-16T14:21:21.171Z","hidden":false},{"_id":"6826b00ed4c8b864e5ed1c10","user":{"_id":"6434226da4c9c55871a78052","avatarUrl":"/avatars/3309832b3115bc6ad08ae1d10f43118b.svg","isPro":false,"fullname":"BoYang Zheng","user":"bytetriper","type":"user"},"name":"Boyang Zheng","status":"admin_assigned","statusLastChangedAt":"2025-05-16T14:21:27.353Z","hidden":false},{"_id":"6826b00ed4c8b864e5ed1c11","user":{"_id":"63172831c92fd6fee3181f50","avatarUrl":"/avatars/0f57068a138cb181e9451bfc1ed3d1c0.svg","isPro":true,"fullname":"Xichen Pan","user":"xcpan","type":"user"},"name":"Xichen Pan","status":"admin_assigned","statusLastChangedAt":"2025-05-16T14:21:34.655Z","hidden":false},{"_id":"6826b00ed4c8b864e5ed1c12","user":{"_id":"5f7fbd813e94f16a85448745","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649681653581-5f7fbd813e94f16a85448745.jpeg","isPro":true,"fullname":"Sayak Paul","user":"sayakpaul","type":"user"},"name":"Sayak Paul","status":"claimed_verified","statusLastChangedAt":"2025-05-16T13:23:23.708Z","hidden":false},{"_id":"6826b00ed4c8b864e5ed1c13","user":{"_id":"6596422646624a86ff3b3bda","avatarUrl":"/avatars/216e12b77e45ac5f1fa20932f5745411.svg","isPro":false,"fullname":"Saining Xie","user":"sainx","type":"user"},"name":"Saining Xie","status":"admin_assigned","statusLastChangedAt":"2025-05-16T14:21:41.504Z","hidden":false}],"publishedAt":"2025-05-15T07:43:23.000Z","submittedOnDailyAt":"2025-05-16T01:55:37.654Z","title":"Exploring the Deep Fusion of Large Language Models and Diffusion\n Transformers for Text-to-Image Synthesis","submittedOnDailyBy":{"_id":"5f7fbd813e94f16a85448745","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649681653581-5f7fbd813e94f16a85448745.jpeg","isPro":true,"fullname":"Sayak Paul","user":"sayakpaul","type":"user"},"summary":"This paper does not describe a new method; instead, it provides a thorough\nexploration of an important yet understudied design space related to recent\nadvances in text-to-image synthesis -- specifically, the deep fusion of large\nlanguage models (LLMs) and diffusion transformers (DiTs) for multi-modal\ngeneration. Previous studies mainly focused on overall system performance\nrather than detailed comparisons with alternative methods, and key design\ndetails and training recipes were often left undisclosed. These gaps create\nuncertainty about the real potential of this approach. To fill these gaps, we\nconduct an empirical study on text-to-image generation, performing controlled\ncomparisons with established baselines, analyzing important design choices, and\nproviding a clear, reproducible recipe for training at scale. We hope this work\noffers meaningful data points and practical guidelines for future research in\nmulti-modal generation.","upvotes":9,"discussionId":"6826b00ed4c8b864e5ed1c4f","githubRepo":"https://github.com/tang-bd/fuse-dit","githubRepoAddedBy":"user","ai_summary":"Empirical exploration of text-to-image synthesis focuses on the deep fusion of large language models and diffusion transformers, providing reproducible guidelines and insights.","ai_keywords":["large language models","diffusion transformers","multi-modal generation","empirical study","controlled comparisons","baseline","design choices","reproducible recipe","training at scale"],"githubStars":131},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"61af81009f77f7b669578f95","avatarUrl":"/avatars/fb50773ac49948940eb231834ee6f2fd.svg","isPro":false,"fullname":"rotem israeli","user":"irotem98","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"64513261938967fd069d2340","avatarUrl":"/avatars/e4c3c435f6a4cda57d0e2f16ec1cda6e.svg","isPro":false,"fullname":"sdtana","user":"sdtana","type":"user"},{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","isPro":false,"fullname":"KABI","user":"dongguanting","type":"user"},{"_id":"6581f9514adaee05cf640f81","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6581f9514adaee05cf640f81/sXvEEraq2QlSIyWHlSmpa.jpeg","isPro":false,"fullname":"Xi","user":"xi0v","type":"user"},{"_id":"64df3ad6a9bcacc18bc0606a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/s3kpJyOf7NwO-tHEpRcok.png","isPro":false,"fullname":"Carlos","user":"Carlosvirella100","type":"user"},{"_id":"68291f60311678c965ac471f","avatarUrl":"/avatars/cd353bfa24806682298e8c2859db4d0b.svg","isPro":false,"fullname":"DENEN JAPHET","user":"lelejaphet2005","type":"user"},{"_id":"67eb5b4cd521f6bb19642605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67eb5b4cd521f6bb19642605/rQCHlFIsfHrkNvj3e9p_o.png","isPro":false,"fullname":"claudelee123","user":"claudeli1234","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Exploring the Deep Fusion of Large Language Models and Diffusion
Transformers for Text-to-Image Synthesis
Published on May 15, 2025
Abstract
Empirical exploration of text-to-image synthesis focuses on the deep fusion of large language models and diffusion transformers, providing reproducible guidelines and insights.
This paper does not describe a new method; instead, it provides a thorough
exploration of an important yet understudied design space related to recent
advances in text-to-image synthesis -- specifically, the deep fusion of large
language models (LLMs) and diffusion transformers (DiTs) for multi-modal
generation. Previous studies mainly focused on overall system performance
rather than detailed comparisons with alternative methods, and key design
details and training recipes were often left undisclosed. These gaps create
uncertainty about the real potential of this approach. To fill these gaps, we
conduct an empirical study on text-to-image generation, performing controlled
comparisons with established baselines, analyzing important design choices, and
providing a clear, reproducible recipe for training at scale. We hope this work
offers meaningful data points and practical guidelines for future research in
multi-modal generation.