Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Diffusion Feedback Helps CLIP See Better
https://github.com/baaivision/DIVA\n","updatedAt":"2024-07-30T03:31:11.393Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9180,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8398545384407043},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"66a992828f5cada4a6f3949a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-07-31T01:25:22.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VIMI: Grounding Video Generation through Multi-modal Instruction](https://huggingface.co/papers/2407.06304) (2024)\n* [Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation](https://huggingface.co/papers/2406.00670) (2024)\n* [VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval](https://huggingface.co/papers/2406.04292) (2024)\n* [MATE: Meet At The Embedding -- Connecting Images with Long Texts](https://huggingface.co/papers/2407.09541) (2024)\n* [X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs](https://huggingface.co/papers/2407.13851) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-07-31T01:25:22.462Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.709954559803009},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.20171","authors":[{"_id":"66a85e795f395fc0c6c7afac","user":{"_id":"656428b5462e5ebcbf537d4e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656428b5462e5ebcbf537d4e/WkX8Z_OJPOGAcuKq3aPXL.jpeg","isPro":false,"fullname":"Wenxuan Wang","user":"Rookielion","type":"user"},"name":"Wenxuan Wang","status":"claimed_verified","statusLastChangedAt":"2024-07-30T09:51:07.296Z","hidden":false},{"_id":"66a85e795f395fc0c6c7afad","user":{"_id":"630d7a8f81ef9b1772b67f4c","avatarUrl":"/avatars/00757abd6e548ccebb5bfb233be129a2.svg","isPro":false,"fullname":"Quan Sun","user":"QuanSun","type":"user"},"name":"Quan Sun","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:42:05.741Z","hidden":false},{"_id":"66a85e795f395fc0c6c7afae","user":{"_id":"640ed40dc025ddf618950af7","avatarUrl":"/avatars/a99abd4346fa649ad4144f284ebcc972.svg","isPro":false,"fullname":"Fan Zhang","user":"ryanzhangfan","type":"user"},"name":"Fan Zhang","status":"claimed_verified","statusLastChangedAt":"2024-10-08T07:22:37.841Z","hidden":false},{"_id":"66a85e795f395fc0c6c7afaf","user":{"_id":"64dee07b3d3a7519f18ebbe2","avatarUrl":"/avatars/8f113938715b094435e014509630aa68.svg","isPro":false,"fullname":"tang","user":"yepeng98","type":"user"},"name":"Yepeng Tang","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:42:34.745Z","hidden":false},{"_id":"66a85e795f395fc0c6c7afb0","name":"Jing Liu","hidden":false},{"_id":"66a85e795f395fc0c6c7afb1","user":{"_id":"63ca558304c979828311c5a5","avatarUrl":"/avatars/2a439d79fba2f987cabe780d10c94d25.svg","isPro":false,"fullname":"Xinlong Wang","user":"xinlongwang","type":"user"},"name":"Xinlong Wang","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:43:46.103Z","hidden":false}],"publishedAt":"2024-07-29T17:00:09.000Z","submittedOnDailyAt":"2024-07-30T02:01:11.387Z","title":"Diffusion Feedback Helps CLIP See Better","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Contrastive Language-Image Pre-training (CLIP), which excels at abstracting\nopen-world representations across domains and modalities, has become a\nfoundation for a variety of vision and multimodal tasks. However, recent\nstudies reveal that CLIP has severe visual shortcomings, such as which can\nhardly distinguish orientation, quantity, color, structure, etc. These visual\nshortcomings also limit the perception capabilities of multimodal large\nlanguage models (MLLMs) built on CLIP. The main reason could be that the\nimage-text pairs used to train CLIP are inherently biased, due to the lack of\nthe distinctiveness of the text and the diversity of images. In this work, we\npresent a simple post-training approach for CLIP models, which largely\novercomes its visual shortcomings via a self-supervised diffusion process. We\nintroduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP.\nSpecifically, DIVA leverages generative feedback from text-to-image diffusion\nmodels to optimize CLIP representations, with only images (without\ncorresponding text). We demonstrate that DIVA improves CLIP's performance on\nthe challenging MMVP-VLM benchmark which assesses fine-grained visual abilities\nto a large extent (e.g., 3-7%), and enhances the performance of MLLMs and\nvision models on multimodal understanding and segmentation tasks. Extensive\nevaluation on 29 image classification and retrieval benchmarks confirms that\nour framework preserves CLIP's strong zero-shot capabilities. The code will be\navailable at https://github.com/baaivision/DIVA.","upvotes":36,"discussionId":"66a85e7a5f395fc0c6c7b013","githubRepo":"https://github.com/baaivision/diva","githubRepoAddedBy":"auto","ai_summary":"DIVA enhances CLIP's performance through a self-supervised diffusion process, improving visual capabilities and multimodal understanding without additional text labels.","ai_keywords":["Contrastive Language-Image Pre-training","CLIP","visual shortcomings","multimodal large language models","MLLMs","image-text pairs","self-supervised diffusion","DIffusion model","Visual Assistant","generative feedback","text-to-image diffusion","MMVP-VLM benchmark","image classification","image retrieval","zero-shot capabilities"],"githubStars":299},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"61af81009f77f7b669578f95","avatarUrl":"/avatars/fb50773ac49948940eb231834ee6f2fd.svg","isPro":false,"fullname":"rotem israeli","user":"irotem98","type":"user"},{"_id":"65ba471ad88a65abb9328ee2","avatarUrl":"/avatars/956238ce5034091e64d026b0272c4400.svg","isPro":false,"fullname":"Dazhi Jiang","user":"thuzhizhi","type":"user"},{"_id":"63ca558304c979828311c5a5","avatarUrl":"/avatars/2a439d79fba2f987cabe780d10c94d25.svg","isPro":false,"fullname":"Xinlong Wang","user":"xinlongwang","type":"user"},{"_id":"656428b5462e5ebcbf537d4e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656428b5462e5ebcbf537d4e/WkX8Z_OJPOGAcuKq3aPXL.jpeg","isPro":false,"fullname":"Wenxuan Wang","user":"Rookielion","type":"user"},{"_id":"64f955c582673b2a07fbf0ad","avatarUrl":"/avatars/1c98c8be61f6580c1e4ee698fa5c0716.svg","isPro":false,"fullname":"hongyu","user":"learn12138","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64dee07b3d3a7519f18ebbe2","avatarUrl":"/avatars/8f113938715b094435e014509630aa68.svg","isPro":false,"fullname":"tang","user":"yepeng98","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"630d7a8f81ef9b1772b67f4c","avatarUrl":"/avatars/00757abd6e548ccebb5bfb233be129a2.svg","isPro":false,"fullname":"Quan Sun","user":"QuanSun","type":"user"},{"_id":"646e86350867c99c2d3f2ecf","avatarUrl":"/avatars/b89798ff623abffb169eacda2ac32fde.svg","isPro":true,"fullname":"Han Lin","user":"hanlincs","type":"user"},{"_id":"66896c588b6217e04fbbfcca","avatarUrl":"/avatars/f1c1b1d37173cd599a0392729d0c6a60.svg","isPro":false,"fullname":"Deanna Powers","user":"deann33","type":"user"},{"_id":"66897e8457fd09c47d11dc28","avatarUrl":"/avatars/ca8fb71afe84c638f448d9c17f023e29.svg","isPro":false,"fullname":"Anton Brooks","user":"ANTONBROOKS","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
DIVA enhances CLIP's performance through a self-supervised diffusion process, improving visual capabilities and multimodal understanding without additional text labels.
AI-generated summary
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting
open-world representations across domains and modalities, has become a
foundation for a variety of vision and multimodal tasks. However, recent
studies reveal that CLIP has severe visual shortcomings, such as which can
hardly distinguish orientation, quantity, color, structure, etc. These visual
shortcomings also limit the perception capabilities of multimodal large
language models (MLLMs) built on CLIP. The main reason could be that the
image-text pairs used to train CLIP are inherently biased, due to the lack of
the distinctiveness of the text and the diversity of images. In this work, we
present a simple post-training approach for CLIP models, which largely
overcomes its visual shortcomings via a self-supervised diffusion process. We
introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP.
Specifically, DIVA leverages generative feedback from text-to-image diffusion
models to optimize CLIP representations, with only images (without
corresponding text). We demonstrate that DIVA improves CLIP's performance on
the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities
to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and
vision models on multimodal understanding and segmentation tasks. Extensive
evaluation on 29 image classification and retrieval benchmarks confirms that
our framework preserves CLIP's strong zero-shot capabilities. The code will be
available at https://github.com/baaivision/DIVA.