Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-10-23T01:34:42.990Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6732422113418579},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.18795","authors":[{"_id":"68f840da7669bcaeecce0cf0","name":"Xiaoxing Hu","hidden":false},{"_id":"68f840da7669bcaeecce0cf1","user":{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},"name":"Kaicheng Yang","status":"claimed_verified","statusLastChangedAt":"2025-10-24T16:13:32.884Z","hidden":false},{"_id":"68f840da7669bcaeecce0cf2","name":"Ziyong Feng","hidden":false},{"_id":"68f840da7669bcaeecce0cf3","name":"Qi Ming","hidden":false},{"_id":"68f840da7669bcaeecce0cf4","name":"Zonghao Guo","hidden":false},{"_id":"68f840da7669bcaeecce0cf5","user":{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","isPro":false,"fullname":"xiangan","user":"xiangan","type":"user"},"name":"Xiang An","status":"claimed_verified","statusLastChangedAt":"2026-02-11T11:23:01.551Z","hidden":false},{"_id":"68f840da7669bcaeecce0cf6","name":"Ziyong Feng","hidden":false},{"_id":"68f840da7669bcaeecce0cf7","name":"Junchi Yan","hidden":false},{"_id":"68f840da7669bcaeecce0cf8","name":"Xue Yang","hidden":false}],"publishedAt":"2025-10-21T16:48:49.000Z","submittedOnDailyAt":"2025-10-22T00:58:23.012Z","title":"ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder","submittedOnDailyBy":{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},"summary":"The original CLIP text encoder is limited by a maximum input length of 77\ntokens, which hampers its ability to effectively process long texts and perform\nfine-grained semantic understanding. In addition, the CLIP text encoder lacks\nsupport for multilingual inputs. All these limitations significantly restrict\nits applicability across a broader range of tasks. Recent studies have\nattempted to replace the CLIP text encoder with an LLM-based embedder to\nenhance its ability in processing long texts, multilingual understanding, and\nfine-grained semantic comprehension. However, because the representation spaces\nof LLMs and the vision-language space of CLIP are pretrained independently\nwithout alignment priors, direct alignment using contrastive learning can\ndisrupt the intrinsic vision-language alignment in the CLIP image encoder,\nleading to an underutilization of the knowledge acquired during pre-training.\nTo address this challenge, we propose ProCLIP, a curriculum learning-based\nprogressive vision-language alignment framework to effectively align the CLIP\nimage encoder with an LLM-based embedder. Specifically, ProCLIP first distills\nknowledge from CLIP's text encoder into the LLM-based embedder to leverage\nCLIP's rich pretrained knowledge while establishing initial alignment between\nthe LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns\nthe CLIP image encoder with the LLM-based embedder through image-text\ncontrastive tuning, employing self-distillation regularization to avoid\noverfitting. To achieve a more effective alignment, instance semantic alignment\nloss and embedding structure alignment loss are employed during representation\ninheritance and contrastive tuning. The Code is available at\nhttps://github.com/VisionXLab/ProCLIP","upvotes":11,"discussionId":"68f840da7669bcaeecce0cf9","githubRepo":"https://github.com/VisionXLab/ProCLIP","githubRepoAddedBy":"user","ai_summary":"ProCLIP enhances CLIP's text processing capabilities by aligning its image encoder with an LLM-based embedder through curriculum learning and contrastive tuning, preserving CLIP's pretrained knowledge.","ai_keywords":["CLIP text encoder","LLM-based embedder","curriculum learning","progressive vision-language alignment","knowledge distillation","image-text contrastive tuning","self-distillation regularization","instance semantic alignment loss","embedding structure alignment loss"],"githubStars":21},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},{"_id":"623d7b1c19b08016c234411d","avatarUrl":"/avatars/cbadc8e39e60ddd152c636c81fe2c409.svg","isPro":false,"fullname":"JunWang","user":"ZJUJunWang","type":"user"},{"_id":"67223563fa69c82e19d2232c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/1z_axjIty3uB4UDYa9JK4.png","isPro":false,"fullname":"Xiaoxing Hu","user":"wsdwJohn1231","type":"user"},{"_id":"655c70d331c4978366d4b2e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655c70d331c4978366d4b2e6/X-KjTNkxtzeYu9ngBOh_C.jpeg","isPro":false,"fullname":"yiyexy","user":"yiyexy","type":"user"},{"_id":"660691330be1fbe3b9e4c33d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660691330be1fbe3b9e4c33d/TxrDFH_cRu3AlpMC3xmhv.jpeg","isPro":false,"fullname":"ZiYang Gong","user":"Cusyoung","type":"user"},{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","isPro":false,"fullname":"xiangan","user":"xiangan","type":"user"},{"_id":"669f83bf353227efaefe83d9","avatarUrl":"/avatars/eec4ac39abae492601eebcfe3671ec6b.svg","isPro":false,"fullname":"oulinyu","user":"oulinyu","type":"user"},{"_id":"68f953e02934a6fa774ec669","avatarUrl":"/avatars/4b763af7e086bf0370af5b3e1e618214.svg","isPro":false,"fullname":"Owen Li","user":"Owenl22","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","isPro":false,"fullname":"Yury Panikov","user":"panikov","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2510.18795

ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Published on Oct 21, 2025
· Submitted by
Kaicheng Yang
on Oct 22, 2025
Authors:
,
,
,
,
,
,

Abstract

ProCLIP enhances CLIP's text processing capabilities by aligning its image encoder with an LLM-based embedder through curriculum learning and contrastive tuning, preserving CLIP's pretrained knowledge.

AI-generated summary

The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP

Community

Paper author Paper submitter

The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP’s text encoder into the LLM-based embedder to leverage CLIP’s rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. Extensive experiments show ProCLIP achieves 6.8% to 13.5% improvement on zero-shot classification and presents excellent performance on cross-modal retrieval, multilingual cross-modal retrieval, and fine-grained understanding tasks, demonstrating the effectiveness and robustness of ProCLIP.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.18795 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.18795 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.18795 in a Space README.md to link it from this page.

Collections including this paper 1