Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-09-13T01:35:54.352Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6581917405128479},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.09118","authors":[{"_id":"68c375f3fc1747b91240397f","name":"Tianlu Zheng","hidden":false},{"_id":"68c375f3fc1747b912403980","name":"Yifan Zhang","hidden":false},{"_id":"68c375f3fc1747b912403981","user":{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","isPro":false,"fullname":"xiangan","user":"xiangan","type":"user"},"name":"Xiang An","status":"claimed_verified","statusLastChangedAt":"2025-10-21T14:07:28.190Z","hidden":true},{"_id":"68c375f3fc1747b912403982","name":"Ziyong Feng","hidden":false},{"_id":"68c375f3fc1747b912403983","user":{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},"name":"Kaicheng Yang","status":"claimed_verified","statusLastChangedAt":"2025-09-12T16:09:24.874Z","hidden":false},{"_id":"68c375f3fc1747b912403984","name":"Qichuan Ding","hidden":false}],"publishedAt":"2025-09-11T03:06:22.000Z","submittedOnDailyAt":"2025-09-12T00:58:37.868Z","title":"Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust\n Text-based Person Retrieval","submittedOnDailyBy":{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},"summary":"Although Contrastive Language-Image Pre-training (CLIP) exhibits strong\nperformance across diverse vision tasks, its application to person\nrepresentation learning faces two critical challenges: (i) the scarcity of\nlarge-scale annotated vision-language data focused on person-centric images,\nand (ii) the inherent limitations of global contrastive learning, which\nstruggles to maintain discriminative local features crucial for fine-grained\nmatching while remaining vulnerable to noisy text tokens. This work advances\nCLIP for person representation learning through synergistic improvements in\ndata curation and model architecture. First, we develop a noise-resistant data\nconstruction pipeline that leverages the in-context learning capabilities of\nMLLMs to automatically filter and caption web-sourced images. This yields\nWebPerson, a large-scale dataset of 5M high-quality person-centric image-text\npairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking\nSynergetic) framework, which improves cross-modal alignment by adaptively\nmasking noisy textual tokens based on the gradient-attention similarity score.\nAdditionally, we incorporate masked token prediction objectives that compel the\nmodel to predict informative text tokens, enhancing fine-grained semantic\nrepresentation learning. Extensive experiments show that GA-DMS achieves\nstate-of-the-art performance across multiple benchmarks.","upvotes":8,"discussionId":"68c375f4fc1747b912403985","githubRepo":"https://github.com/Multimodal-Representation-Learning-MRL/GA-DMS","githubRepoAddedBy":"user","ai_summary":"GA-DMS framework enhances CLIP for person representation learning by improving data quality and model architecture, achieving state-of-the-art performance.","ai_keywords":["Contrastive Language-Image Pre-training","CLIP","person representation learning","global contrastive learning","local features","fine-grained matching","noise-resistant data construction","MLLMs","WebPerson","GA-DMS","Gradient-Attention Guided Dual-Masking Synergetic","cross-modal alignment","gradient-attention similarity score","masked token prediction","fine-grained semantic representation learning"],"githubStars":20},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},{"_id":"67223563fa69c82e19d2232c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/1z_axjIty3uB4UDYa9JK4.png","isPro":false,"fullname":"Xiaoxing Hu","user":"wsdwJohn1231","type":"user"},{"_id":"641030c77a15af878ae5bd8f","avatarUrl":"/avatars/8a5037edf55c78ebc317c8b191343671.svg","isPro":false,"fullname":"TianchengGu","user":"TianchengGu","type":"user"},{"_id":"649d3d271bafbcc83acec930","avatarUrl":"/avatars/0c42aabf4c6601686c22cc1308c318de.svg","isPro":false,"fullname":"Wenmeng Yu","user":"iyuge2","type":"user"},{"_id":"623d7b1c19b08016c234411d","avatarUrl":"/avatars/cbadc8e39e60ddd152c636c81fe2c409.svg","isPro":false,"fullname":"JunWang","user":"ZJUJunWang","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"},{"_id":"62e0e2d8eda17fc126978a6b","avatarUrl":"/avatars/ce2f13ee8024e4799957324b3f0c64e7.svg","isPro":false,"fullname":"Chen","user":"richardchen","type":"user"},{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","isPro":false,"fullname":"xiangan","user":"xiangan","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2509.09118

Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval

Published on Sep 11, 2025
· Submitted by
Kaicheng Yang
on Sep 12, 2025
Authors:
,
,
,

Abstract

GA-DMS framework enhances CLIP for person representation learning by improving data quality and model architecture, achieving state-of-the-art performance.

AI-generated summary

Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.

Community

Paper author Paper submitter

Although Contrastive Language-Image Pretraining (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (GradientAttention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.09118 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.09118 in a Space README.md to link it from this page.

Collections including this paper 1