https://github.com/xiaoxing2001/DeGLA.

\n","updatedAt":"2025-04-24T04:34:11.410Z","author":{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","fullname":"Kaicheng Yang","name":"Kaichengalex","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8812254667282104},"editors":["Kaichengalex"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg"],"reactions":[],"isReport":false}},{"id":"680ae6901c5fbd15908db922","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-25T01:34:08.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data](https://huggingface.co/papers/2503.01167) (2025)\n* [GOAL: Global-local Object Alignment Learning](https://huggingface.co/papers/2503.17782) (2025)\n* [TULIP: Towards Unified Language-Image Pretraining](https://huggingface.co/papers/2503.15485) (2025)\n* [Post-pre-training for Modality Alignment in Vision-Language Foundation Models](https://huggingface.co/papers/2504.12717) (2025)\n* [Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching](https://huggingface.co/papers/2503.14953) (2025)\n* [VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models](https://huggingface.co/papers/2504.03970) (2025)\n* [Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection](https://huggingface.co/papers/2503.17080) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-25T01:34:08.168Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.707072913646698},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.16801","authors":[{"_id":"6809bebd0f6dfd7bd5159b76","user":{"_id":"67223563fa69c82e19d2232c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/1z_axjIty3uB4UDYa9JK4.png","isPro":false,"fullname":"Xiaoxing Hu","user":"wsdwJohn1231","type":"user"},"name":"Xiaoxing Hu","status":"claimed_verified","statusLastChangedAt":"2025-04-28T12:50:22.326Z","hidden":false},{"_id":"6809bebd0f6dfd7bd5159b77","user":{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},"name":"Kaicheng Yang","status":"claimed_verified","statusLastChangedAt":"2025-04-24T09:10:22.302Z","hidden":false},{"_id":"6809bebd0f6dfd7bd5159b78","name":"Jun Wang","hidden":false},{"_id":"6809bebd0f6dfd7bd5159b79","user":{"_id":"61384b860317b0a5c10877d3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1631080954171-61384b860317b0a5c10877d3.jpeg","isPro":false,"fullname":"Haoran Xu","user":"haoranxu","type":"user"},"name":"Haoran Xu","status":"admin_assigned","statusLastChangedAt":"2025-04-24T11:26:36.470Z","hidden":false},{"_id":"6809bebd0f6dfd7bd5159b7a","name":"Ziyong Feng","hidden":false},{"_id":"6809bebd0f6dfd7bd5159b7b","name":"Yupei Wang","hidden":false}],"publishedAt":"2025-04-23T15:20:53.000Z","submittedOnDailyAt":"2025-04-24T03:04:11.389Z","title":"Decoupled Global-Local Alignment for Improving Compositional\n Understanding","submittedOnDailyBy":{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},"summary":"Contrastive Language-Image Pre-training (CLIP) has achieved success on\nmultiple downstream tasks by aligning image and text modalities. However, the\nnature of global contrastive learning limits CLIP's ability to comprehend\ncompositional concepts, such as relations and attributes. Although recent\nstudies employ global hard negative samples to improve compositional\nunderstanding, these methods significantly compromise the model's inherent\ngeneral capabilities by forcibly distancing textual negative samples from\nimages in the embedding space. To overcome this limitation, we introduce a\nDecoupled Global-Local Alignment (DeGLA) framework that improves compositional\nunderstanding while substantially mitigating losses in general capabilities. To\noptimize the retention of the model's inherent capabilities, we incorporate a\nself-distillation mechanism within the global alignment process, aligning the\nlearnable image-text encoder with a frozen teacher model derived from an\nexponential moving average. Under the constraint of self-distillation, it\neffectively mitigates the catastrophic forgetting of pretrained knowledge\nduring fine-tuning. To improve compositional understanding, we first leverage\nthe in-context learning capability of Large Language Models (LLMs) to construct\nabout 2M high-quality negative captions across five types. Subsequently, we\npropose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC)\nloss to enhance vision-language compositionally. Extensive experimental results\ndemonstrate the effectiveness of the DeGLA framework. Compared to previous\nstate-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across\nthe VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average\nperformance improvement of 13.0% on zero-shot classification tasks across\neleven datasets. Our code will be released at\nhttps://github.com/xiaoxing2001/DeGLA","upvotes":14,"discussionId":"6809bec10f6dfd7bd5159c38","projectPage":"https://xiaoxing2001.github.io/DeGLA.github.io/","githubRepo":"https://github.com/xiaoxing2001/DeGLA","githubRepoAddedBy":"user","ai_summary":"The DeGLA framework enhances compositional understanding in vision-language models through decoupled global-local alignment, self-distillation, and image-text grounded contrastive losses, improving performance on benchmarks and zero-shot classification tasks.","ai_keywords":["Contrastive Language-Image Pre-training (CLIP)","global contrastive learning","compositional concepts","global hard negative samples","Decoupled Global-Local Alignment (DeGLA)","self-distillation","frozen teacher model","exponential moving average","catastrophic forgetting","Large Language Models (LLMs)","negative captions","Image-Grounded Contrast (IGC) loss","Text-Grounded Contrast (TGC) loss","VALSE","SugarCrepe","ARO benchmarks","zero-shot classification tasks"],"githubStars":15},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},{"_id":"63f7cdbaf60d6f04cf7a75d8","avatarUrl":"/avatars/60153b38c2fd0316b1643cb4a5e68bca.svg","isPro":false,"fullname":"Yifan Zhang","user":"CyanTransformer","type":"user"},{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","isPro":false,"fullname":"xiangan","user":"xiangan","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"641030c77a15af878ae5bd8f","avatarUrl":"/avatars/8a5037edf55c78ebc317c8b191343671.svg","isPro":false,"fullname":"TianchengGu","user":"TianchengGu","type":"user"},{"_id":"64fc24e931a82e0d40448d49","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64fc24e931a82e0d40448d49/Ggx--x3wwcb1kivkxIROE.jpeg","isPro":false,"fullname":"xiaoyichao","user":"yichaoxiao","type":"user"},{"_id":"669f83bf353227efaefe83d9","avatarUrl":"/avatars/eec4ac39abae492601eebcfe3671ec6b.svg","isPro":false,"fullname":"oulinyu","user":"oulinyu","type":"user"},{"_id":"655c70d331c4978366d4b2e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655c70d331c4978366d4b2e6/X-KjTNkxtzeYu9ngBOh_C.jpeg","isPro":false,"fullname":"yiyexy","user":"yiyexy","type":"user"},{"_id":"67223563fa69c82e19d2232c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/1z_axjIty3uB4UDYa9JK4.png","isPro":false,"fullname":"Xiaoxing Hu","user":"wsdwJohn1231","type":"user"},{"_id":"64ae9f819ca42c064b7b363a","avatarUrl":"/avatars/3d91a73413f76ad19885df3fd0683db6.svg","isPro":false,"fullname":"Yangcongcong","user":"yangcongcong","type":"user"},{"_id":"673c09d251d8d86ed0e4b343","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/9p0jpsSw_HYINDHsKasDW.png","isPro":false,"fullname":"guo","user":"sigma28","type":"user"},{"_id":"665b133508d536a8ac804f7d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Uwi0OnANdTbRbHHQvGqvR.png","isPro":false,"fullname":"Paulson","user":"Pnaomi","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2504.16801

Decoupled Global-Local Alignment for Improving Compositional Understanding

Published on Apr 23, 2025

· Submitted by

Kaicheng Yang on Apr 24, 2025

Upvote

Authors:

Xiaoxing Hu ,

Kaicheng Yang ,

Haoran Xu ,

Abstract

The DeGLA framework enhances compositional understanding in vision-language models through decoupled global-local alignment, self-distillation, and image-text grounded contrastive losses, improving performance on benchmarks and zero-shot classification tasks.

AI-generated summary

Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA