Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
[go: Go Back, main page]

https://mizhenxing.github.io/ThinkDiff
Code to be released at: https://github.com/MiZhenxing/ThinkDiff
Arxiv: https://arxiv.org/abs/2502.10458
Huggingface paper page: https://huggingface.co/papers/2502.10458

\n","updatedAt":"2025-02-18T12:40:20.657Z","author":{"_id":"6354bda206d707b33249c4c2","avatarUrl":"/avatars/bbd9f76274ac52214df92084d50bc7b5.svg","fullname":"Zhenxing Mi","name":"Mifucius","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4943580627441406},"editors":["Mifucius"],"editorAvatarUrls":["/avatars/bbd9f76274ac52214df92084d50bc7b5.svg"],"reactions":[],"isReport":false}},{"id":"67b5357919b0fa6959a056e4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-02-19T01:35:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models](https://huggingface.co/papers/2501.00917) (2025)\n* [Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks](https://huggingface.co/papers/2501.02527) (2025)\n* [AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding](https://huggingface.co/papers/2502.01341) (2025)\n* [Decoder-Only LLMs are Better Controllers for Diffusion Models](https://huggingface.co/papers/2502.04412) (2025)\n* [EliGen: Entity-Level Controlled Image Generation with Regional Attention](https://huggingface.co/papers/2501.01097) (2025)\n* [Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens](https://huggingface.co/papers/2501.07730) (2025)\n* [ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models](https://huggingface.co/papers/2501.12418) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-02-19T01:35:53.363Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6885784864425659},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.10458","authors":[{"_id":"67b3ea0f4dd7ea0538ce589d","user":{"_id":"6354bda206d707b33249c4c2","avatarUrl":"/avatars/bbd9f76274ac52214df92084d50bc7b5.svg","isPro":false,"fullname":"Zhenxing Mi","user":"Mifucius","type":"user"},"name":"Zhenxing Mi","status":"claimed_verified","statusLastChangedAt":"2025-02-18T09:31:52.837Z","hidden":false},{"_id":"67b3ea0f4dd7ea0538ce589e","user":{"_id":"648ca58a39d2584ee47efef6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648ca58a39d2584ee47efef6/R7B72bnwc59mdK45rmzYS.png","isPro":false,"fullname":"Kuan-Chieh Wang","user":"wangkua1","type":"user"},"name":"Kuan-Chieh Wang","status":"admin_assigned","statusLastChangedAt":"2025-02-19T15:21:46.349Z","hidden":false},{"_id":"67b3ea0f4dd7ea0538ce589f","user":{"_id":"645fed74335c21d19f3bf76c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645fed74335c21d19f3bf76c/gwVsllRWtSHbg4a1erkdF.jpeg","isPro":false,"fullname":"Guocheng Gordon Qian","user":"guochengqian","type":"user"},"name":"Guocheng Qian","status":"admin_assigned","statusLastChangedAt":"2025-02-19T15:21:52.861Z","hidden":false},{"_id":"67b3ea0f4dd7ea0538ce58a0","user":{"_id":"62d3ae4d894e7fe42def988f","avatarUrl":"/avatars/3aafc55d9783459f9a79546fc31dd68a.svg","isPro":false,"fullname":"Hanrong Ye","user":"leoye","type":"user"},"name":"Hanrong Ye","status":"admin_assigned","statusLastChangedAt":"2025-02-19T15:21:58.865Z","hidden":false},{"_id":"67b3ea0f4dd7ea0538ce58a1","user":{"_id":"64a653e330dd11336539c439","avatarUrl":"/avatars/348910ea160829707ac5e74f9f824c60.svg","isPro":false,"fullname":"liuruntao","user":"runtao","type":"user"},"name":"Runtao Liu","status":"admin_assigned","statusLastChangedAt":"2025-02-19T15:22:08.197Z","hidden":false},{"_id":"67b3ea0f4dd7ea0538ce58a2","name":"Sergey Tulyakov","hidden":false},{"_id":"67b3ea0f4dd7ea0538ce58a3","user":{"_id":"64db29097266618e853dd6ec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64db29097266618e853dd6ec/r0MaPQCfAxeKv3ycdKYLK.jpeg","isPro":false,"fullname":"Kfir Aberman","user":"kaberman","type":"user"},"name":"Kfir Aberman","status":"admin_assigned","statusLastChangedAt":"2025-02-19T15:22:17.444Z","hidden":false},{"_id":"67b3ea0f4dd7ea0538ce58a4","user":{"_id":"66feab48651e00e22f33222e","avatarUrl":"/avatars/7344377e2c796c7ec85194bb2fc78521.svg","isPro":false,"fullname":"Dan Xu","user":"danxuhk","type":"user"},"name":"Dan Xu","status":"claimed_verified","statusLastChangedAt":"2025-02-19T09:04:53.095Z","hidden":false}],"publishedAt":"2025-02-12T05:30:08.000Z","submittedOnDailyAt":"2025-02-18T07:03:41.120Z","title":"I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning\n in Diffusion Models","submittedOnDailyBy":{"_id":"6354bda206d707b33249c4c2","avatarUrl":"/avatars/bbd9f76274ac52214df92084d50bc7b5.svg","isPro":false,"fullname":"Zhenxing Mi","user":"Mifucius","type":"user"},"summary":"This paper presents ThinkDiff, a novel alignment paradigm that empowers\ntext-to-image diffusion models with multimodal in-context understanding and\nreasoning capabilities by integrating the strengths of vision-language models\n(VLMs). Existing multimodal diffusion finetuning methods largely focus on\npixel-level reconstruction rather than in-context reasoning, and are\nconstrained by the complexity and limited availability of reasoning-based\ndatasets. ThinkDiff addresses these challenges by leveraging vision-language\ntraining as a proxy task, aligning VLMs with the decoder of an encoder-decoder\nlarge language model (LLM) instead of a diffusion decoder. This proxy task\nbuilds on the observation that the LLM decoder shares the same input\nfeature space with diffusion decoders that use the corresponding\nLLM encoder for prompt embedding. As a result, aligning VLMs with\ndiffusion decoders can be simplified through alignment with the LLM decoder.\nWithout complex training and datasets, ThinkDiff effectively unleashes\nunderstanding, reasoning, and composing capabilities in diffusion models.\nExperiments demonstrate that ThinkDiff significantly improves accuracy from\n19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context\nreasoning generation, with only 5 hours of training on 4 A100 GPUs.\nAdditionally, ThinkDiff demonstrates exceptional performance in composing\nmultiple images and texts into logically coherent images. Project page:\nhttps://mizhenxing.github.io/ThinkDiff.","upvotes":38,"discussionId":"67b3ea124dd7ea0538ce592d","projectPage":"https://mizhenxing.github.io/ThinkDiff","githubRepo":"https://github.com/MiZhenxing/ThinkDiff","githubRepoAddedBy":"user","ai_summary":"ThinkDiff improves multimodal reasoning in text-to-image diffusion models using vision-language models aligned with the decoder of an encoder-decoder large language model.","ai_keywords":["vision-language models","multimodal diffusion","in-context reasoning","encoder-decoder.large language model","diffusion decoders","LLM decoder","CoBSAT benchmark","logically coherent images"],"githubStars":193},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6354bda206d707b33249c4c2","avatarUrl":"/avatars/bbd9f76274ac52214df92084d50bc7b5.svg","isPro":false,"fullname":"Zhenxing Mi","user":"Mifucius","type":"user"},{"_id":"647d893bcfca67bc50f8974f","avatarUrl":"/avatars/fab275f968b7d648dbdf8485c9279fb6.svg","isPro":false,"fullname":"Yang Cao","user":"YangCaoCS","type":"user"},{"_id":"64c21a7f576884e0fac15aee","avatarUrl":"/avatars/fa3a8bbc4abc0136d5f9c17de1ddf8da.svg","isPro":false,"fullname":"yueqi Xie","user":"xyq7","type":"user"},{"_id":"6594d390674349122ce6f368","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6594d390674349122ce6f368/KdWz6lZyGYQpjAgBDeiC1.jpeg","isPro":false,"fullname":"Zedong Wang","user":"JackyWangAI","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"63044075d14428368d1c4c6c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63044075d14428368d1c4c6c/P2WhQMib5VXyT0w-Sjf2V.jpeg","isPro":false,"fullname":"Khoi Nguyen","user":"ducminhkhoi","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6581f9514adaee05cf640f81","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6581f9514adaee05cf640f81/sXvEEraq2QlSIyWHlSmpa.jpeg","isPro":false,"fullname":"Xi","user":"xi0v","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"644b78959e85a62bf07655f2","avatarUrl":"/avatars/518660b7743715af57629e863a038165.svg","isPro":false,"fullname":"Dmitri Iourovitski","user":"IoDmitri","type":"user"},{"_id":"652b83b73b5997ed71a310f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652b83b73b5997ed71a310f2/ipCpdeHUp4-0OmRz5z8IW.png","isPro":false,"fullname":"Rui Zhao","user":"ruizhaocv","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2502.10458

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Published on Feb 12, 2025
· Submitted by
Zhenxing Mi
on Feb 18, 2025

Abstract

ThinkDiff improves multimodal reasoning in text-to-image diffusion models using vision-language models aligned with the decoder of an encoder-decoder large language model.

AI-generated summary

This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the LLM decoder shares the same input feature space with diffusion decoders that use the corresponding LLM encoder for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.

Community

Paper author Paper submitter

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.10458 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.10458 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.10458 in a Space README.md to link it from this page.

Collections including this paper 16