Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Region-based Cluster Discrimination for Visual Representation Learning
\n","updatedAt":"2025-07-29T02:27:48.186Z","author":{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","fullname":"xiangan","name":"xiangan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4301263391971588},"editors":["xiangan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg"],"reactions":[],"isReport":false}},{"id":"6889776770664164b087257b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-07-30T01:37:43.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation](https://huggingface.co/papers/2506.16806) (2025)\n* [Hierarchical Cross-modal Prompt Learning for Vision-Language Models](https://huggingface.co/papers/2507.14976) (2025)\n* [Advancing Visual Large Language Model for Multi-granular Versatile Perception](https://huggingface.co/papers/2507.16213) (2025)\n* [HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment](https://huggingface.co/papers/2506.13925) (2025)\n* [Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better](https://huggingface.co/papers/2506.09040) (2025)\n* [HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation](https://huggingface.co/papers/2507.12883) (2025)\n* [Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation](https://huggingface.co/papers/2506.22032) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-07-30T01:37:43.507Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6891946792602539},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2507.20025","authors":[{"_id":"68883088af872d625c10c5eb","name":"Yin Xie","hidden":false},{"_id":"68883088af872d625c10c5ec","user":{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},"name":"Kaicheng Yang","status":"claimed_verified","statusLastChangedAt":"2025-07-29T07:08:26.917Z","hidden":false},{"_id":"68883088af872d625c10c5ed","user":{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","isPro":false,"fullname":"xiangan","user":"xiangan","type":"user"},"name":"Xiang An","status":"claimed_verified","statusLastChangedAt":"2025-07-29T07:08:29.232Z","hidden":false},{"_id":"68883088af872d625c10c5ee","user":{"_id":"64f597588b6d053c709debd9","avatarUrl":"/avatars/85d046a412c503f883e16324030a457b.svg","isPro":false,"fullname":"Kun","user":"Athinklo","type":"user"},"name":"Kun Wu","status":"claimed_verified","statusLastChangedAt":"2025-07-29T07:08:24.529Z","hidden":false},{"_id":"68883088af872d625c10c5ef","user":{"_id":"6433e9e4ea46c00990460ab0","avatarUrl":"/avatars/3a4bec74eeec94713e8eb5e290d00765.svg","isPro":false,"fullname":"yonglezhao","user":"yonglezhao","type":"user"},"name":"Yongle Zhao","status":"claimed_verified","statusLastChangedAt":"2025-09-05T14:17:13.525Z","hidden":false},{"_id":"68883088af872d625c10c5f0","name":"Weimo Deng","hidden":false},{"_id":"68883088af872d625c10c5f1","name":"Zimin Ran","hidden":false},{"_id":"68883088af872d625c10c5f2","name":"Yumeng Wang","hidden":false},{"_id":"68883088af872d625c10c5f3","name":"Ziyong Feng","hidden":false},{"_id":"68883088af872d625c10c5f4","user":{"_id":"6478cea13b7f8b1f624997e5","avatarUrl":"/avatars/34641d634da565cd50a7310de3f4513f.svg","isPro":false,"fullname":"Roy Miles","user":"iyop45","type":"user"},"name":"Roy Miles","status":"claimed_verified","statusLastChangedAt":"2025-09-12T16:22:55.805Z","hidden":false},{"_id":"68883088af872d625c10c5f5","name":"Ismail Elezi","hidden":false},{"_id":"68883088af872d625c10c5f6","name":"Jiankang Deng","hidden":false}],"publishedAt":"2025-07-26T17:47:09.000Z","submittedOnDailyAt":"2025-07-29T00:56:42.617Z","title":"Region-based Cluster Discrimination for Visual Representation Learning","submittedOnDailyBy":{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","isPro":false,"fullname":"xiangan","user":"xiangan","type":"user"},"summary":"Learning visual representations is foundational for a broad spectrum of\ndownstream tasks. Although recent vision-language contrastive models, such as\nCLIP and SigLIP, have achieved impressive zero-shot performance via large-scale\nvision-language alignment, their reliance on global representations constrains\ntheir effectiveness for dense prediction tasks, such as grounding, OCR, and\nsegmentation. To address this gap, we introduce Region-Aware Cluster\nDiscrimination (RICE), a novel method that enhances region-level visual and OCR\ncapabilities. We first construct a billion-scale candidate region dataset and\npropose a Region Transformer layer to extract rich regional semantics. We\nfurther design a unified region cluster discrimination loss that jointly\nsupports object and OCR learning within a single classification framework,\nenabling efficient and scalable distributed training on large-scale data.\nExtensive experiments show that RICE consistently outperforms previous methods\non tasks, including segmentation, dense detection, and visual perception for\nMultimodal Large Language Models (MLLMs). The pre-trained models have been\nreleased at https://github.com/deepglint/MVT.","upvotes":19,"discussionId":"68883088af872d625c10c5f7","githubRepo":"https://github.com/deepglint/MVT","githubRepoAddedBy":"user","ai_summary":"RICE, a novel method using a Region Transformer and region cluster discrimination loss, enhances region-level visual and OCR capabilities, outperforming previous methods in tasks like segmentation and dense detection.","ai_keywords":["Region-Aware Cluster Discrimination","RICE","Region Transformer","region cluster discrimination loss","dense prediction tasks","grounding","OCR","segmentation","dense detection","Multimodal Large Language Models","MLLMs"],"githubStars":66},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","isPro":false,"fullname":"xiangan","user":"xiangan","type":"user"},{"_id":"655c70d331c4978366d4b2e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655c70d331c4978366d4b2e6/X-KjTNkxtzeYu9ngBOh_C.jpeg","isPro":false,"fullname":"yiyexy","user":"yiyexy","type":"user"},{"_id":"623d7b1c19b08016c234411d","avatarUrl":"/avatars/cbadc8e39e60ddd152c636c81fe2c409.svg","isPro":false,"fullname":"JunWang","user":"ZJUJunWang","type":"user"},{"_id":"67ff56f52c0d96649b1bbbe8","avatarUrl":"/avatars/a6c0e0413eed8847ef4bc73345356e6d.svg","isPro":false,"fullname":"Zhichao Chen","user":"GukehAn","type":"user"},{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},{"_id":"641030c77a15af878ae5bd8f","avatarUrl":"/avatars/8a5037edf55c78ebc317c8b191343671.svg","isPro":false,"fullname":"TianchengGu","user":"TianchengGu","type":"user"},{"_id":"673c09d251d8d86ed0e4b343","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/9p0jpsSw_HYINDHsKasDW.png","isPro":false,"fullname":"guo","user":"sigma28","type":"user"},{"_id":"67223563fa69c82e19d2232c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/1z_axjIty3uB4UDYa9JK4.png","isPro":false,"fullname":"Xiaoxing Hu","user":"wsdwJohn1231","type":"user"},{"_id":"64f597588b6d053c709debd9","avatarUrl":"/avatars/85d046a412c503f883e16324030a457b.svg","isPro":false,"fullname":"Kun","user":"Athinklo","type":"user"},{"_id":"67ebb4622dccb686675b1c4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6ZMZn-lw2rWcm8eUoEbpY.png","isPro":false,"fullname":"tangkaijie","user":"tukjet00","type":"user"},{"_id":"62cc7a38376917c0223dd24b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cc7a38376917c0223dd24b/VytbnJrtkFy4-GBgWo8pP.jpeg","isPro":false,"fullname":"JiankangDeng","user":"JiankangDeng","type":"user"},{"_id":"68889d2869f8578a40cec8e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/wsORKt4bHfiKu6WnRWCIO.png","isPro":false,"fullname":"Tyreke McGlade","user":"tyrekeMG","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
RICE, a novel method using a Region Transformer and region cluster discrimination loss, enhances region-level visual and OCR capabilities, outperforming previous methods in tasks like segmentation and dense detection.
AI-generated summary
Learning visual representations is foundational for a broad spectrum of
downstream tasks. Although recent vision-language contrastive models, such as
CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale
vision-language alignment, their reliance on global representations constrains
their effectiveness for dense prediction tasks, such as grounding, OCR, and
segmentation. To address this gap, we introduce Region-Aware Cluster
Discrimination (RICE), a novel method that enhances region-level visual and OCR
capabilities. We first construct a billion-scale candidate region dataset and
propose a Region Transformer layer to extract rich regional semantics. We
further design a unified region cluster discrimination loss that jointly
supports object and OCR learning within a single classification framework,
enabling efficient and scalable distributed training on large-scale data.
Extensive experiments show that RICE consistently outperforms previous methods
on tasks, including segmentation, dense detection, and visual perception for
Multimodal Large Language Models (MLLMs). The pre-trained models have been
released at https://github.com/deepglint/MVT.