Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - ModernVBERT: Towards Smaller Visual Document Retrievers
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-10-04T01:37:54.955Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6963590383529663},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.01149","authors":[{"_id":"68de63af70ada21878c75026","user":{"_id":"6651baf4b34bbdaec88333e7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6651baf4b34bbdaec88333e7/YTk8BiBAfgF0oDVOehULS.jpeg","isPro":false,"fullname":"Paul Teiletche","user":"paultltc","type":"user"},"name":"Paul Teiletche","status":"claimed_verified","statusLastChangedAt":"2025-12-25T20:55:13.002Z","hidden":false},{"_id":"68de63af70ada21878c75027","user":{"_id":"661e945eebe3616a1b09e279","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661e945eebe3616a1b09e279/U3DL1BNouUpcusCKAPZm0.jpeg","isPro":false,"fullname":"Quentin Macé","user":"QuentinJG","type":"user"},"name":"Quentin Macé","status":"claimed_verified","statusLastChangedAt":"2025-10-03T12:52:04.668Z","hidden":false},{"_id":"68de63af70ada21878c75028","name":"Max Conti","hidden":false},{"_id":"68de63af70ada21878c75029","user":{"_id":"6377b63b24d97f9f7ec70064","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6377b63b24d97f9f7ec70064/vBl68PAwwp353MOeOahMH.jpeg","isPro":false,"fullname":"Antonio Loison","user":"antonioloison","type":"user"},"name":"Antonio Loison","status":"claimed_verified","statusLastChangedAt":"2025-10-11T14:05:49.301Z","hidden":false},{"_id":"68de63af70ada21878c7502a","name":"Gautier Viaud","hidden":false},{"_id":"68de63af70ada21878c7502b","name":"Pierre Colombo","hidden":false},{"_id":"68de63af70ada21878c7502c","user":{"_id":"60f2e021adf471cbdf8bb660","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654090481550-60f2e021adf471cbdf8bb660.jpeg","isPro":false,"fullname":"Manuel Faysse","user":"manu","type":"user"},"name":"Manuel Faysse","status":"claimed_verified","statusLastChangedAt":"2025-10-03T12:52:47.596Z","hidden":false}],"publishedAt":"2025-10-01T17:41:17.000Z","submittedOnDailyAt":"2025-10-03T09:50:48.566Z","title":"ModernVBERT: Towards Smaller Visual Document Retrievers","submittedOnDailyBy":{"_id":"60f2e021adf471cbdf8bb660","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654090481550-60f2e021adf471cbdf8bb660.jpeg","isPro":false,"fullname":"Manuel Faysse","user":"manu","type":"user"},"summary":"Multimodal embedding models are gaining prevalence, notably for document\nretrieval as efficient alternatives to text-only pipelines. These models are\ntypically built by finetuning large vision-language decoders (VLMs) with\ncontrastive losses on text-image pairs. In this work, we show that, while\ncost-efficient, this repurposing approach often bottlenecks retrieval\nperformance. Through controlled experiments, we establish a principled recipe\nfor improving visual document retrieval models. We notably measure the impact\nof attention masking, image resolution, modality alignment data regimes, and\nlate interaction centered contrastive objectives which emerge as central\nperformance factors. Building on these insights, we release ModernVBERT, a\ncompact 250M-parameter vision-language encoder that outperforms models up to 10\ntimes larger when finetuned on document retrieval tasks. Models and code are\nmade available at https://huggingface.co/ModernVBERT.","upvotes":32,"discussionId":"68de63b070ada21878c7502d","projectPage":"https://huggingface.co/ModernVBERT","githubRepo":"https://github.com/illuin-tech/modernvbert","githubRepoAddedBy":"auto","ai_summary":"ModernVBERT, a compact vision-language encoder, outperforms larger models in document retrieval by optimizing attention masking, image resolution, modality alignment, and contrastive objectives.","ai_keywords":["multimodal embedding models","document retrieval","vision-language decoders","contrastive losses","attention masking","image resolution","modality alignment","contrastive objectives","vision-language encoder"],"githubStars":12,"organization":{"_id":"68dc126476aff34f469efbc4","name":"ModernVBERT","fullname":"ModernVBERT","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6651baf4b34bbdaec88333e7/qxlWj1d9iagGW6T8yCkJV.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"60f2e021adf471cbdf8bb660","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654090481550-60f2e021adf471cbdf8bb660.jpeg","isPro":false,"fullname":"Manuel Faysse","user":"manu","type":"user"},{"_id":"661e945eebe3616a1b09e279","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661e945eebe3616a1b09e279/U3DL1BNouUpcusCKAPZm0.jpeg","isPro":false,"fullname":"Quentin Macé","user":"QuentinJG","type":"user"},{"_id":"6720a87e392e9cea0187fde6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6720a87e392e9cea0187fde6/vW8DW31UvdKD809UyYCS4.jpeg","isPro":false,"fullname":"Max Conti","user":"mlconti","type":"user"},{"_id":"618a65e17304dc918c6602ff","avatarUrl":"/avatars/8af1112094169da80c65e24ab71c7e59.svg","isPro":false,"fullname":"Gautier Viaud","user":"gautierviaud","type":"user"},{"_id":"64be3b2b805e5b64572eec44","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/VDCwEAgQ2G0fGCvB5L8cA.jpeg","isPro":false,"fullname":"Alexander Micklewright","user":"alexmick","type":"user"},{"_id":"66e16a677c2eb2da5109fb5c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66e16a677c2eb2da5109fb5c/jPkg3g4mtCYbm8ci-ymtX.png","isPro":false,"fullname":"Antoine EDY","user":"antoineedy","type":"user"},{"_id":"67a8dc9e560939c755f39fb5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67a8dc9e560939c755f39fb5/uyfHhPi4fkGPIurIr7SP4.jpeg","isPro":false,"fullname":"Victor Xing","user":"vxing","type":"user"},{"_id":"67ab5868b028527841ce09f7","avatarUrl":"/avatars/5e8b7a341a807bd3e08c2cb98ca20dd3.svg","isPro":false,"fullname":"Iker TARDIO","user":"tardioik","type":"user"},{"_id":"67b708f9ae6ee066e29045c3","avatarUrl":"/avatars/824a8973e1013a615538df58ef9dfa17.svg","isPro":false,"fullname":"Marine Neyret","user":"mneyret","type":"user"},{"_id":"64a2e1ab812528832070a4b4","avatarUrl":"/avatars/c2fef6d21cfd4b66085127e7f3223025.svg","isPro":false,"fullname":"Victor Alibert","user":"victoralibert","type":"user"},{"_id":"63b6cb6b029518c6bbfda056","avatarUrl":"/avatars/28849f397a60eb33c359cf536f109485.svg","isPro":false,"fullname":"Tom Brendlé","user":"tombrendle","type":"user"},{"_id":"62b5db0a73ab76290041245a","avatarUrl":"/avatars/6c94a42e8f647578e7d31fcdfcbd591e.svg","isPro":false,"fullname":"Paul des Garets","user":"pdesgarets-illuin","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"68dc126476aff34f469efbc4","name":"ModernVBERT","fullname":"ModernVBERT","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6651baf4b34bbdaec88333e7/qxlWj1d9iagGW6T8yCkJV.png"}}">
ModernVBERT, a compact vision-language encoder, outperforms larger models in document retrieval by optimizing attention masking, image resolution, modality alignment, and contrastive objectives.
AI-generated summary
Multimodal embedding models are gaining prevalence, notably for document
retrieval as efficient alternatives to text-only pipelines. These models are
typically built by finetuning large vision-language decoders (VLMs) with
contrastive losses on text-image pairs. In this work, we show that, while
cost-efficient, this repurposing approach often bottlenecks retrieval
performance. Through controlled experiments, we establish a principled recipe
for improving visual document retrieval models. We notably measure the impact
of attention masking, image resolution, modality alignment data regimes, and
late interaction centered contrastive objectives which emerge as central
performance factors. Building on these insights, we release ModernVBERT, a
compact 250M-parameter vision-language encoder that outperforms models up to 10
times larger when finetuned on document retrieval tasks. Models and code are
made available at https://huggingface.co/ModernVBERT.
Multimodal embedding models are gaining prevalence, notably for document retrieval as efficient alternatives to text-only pipelines. These models are typically built by finetuning large vision-language decoders (VLMs) with contrastive losses on text-image pairs. In this work, we show that, while cost-efficient, this repurposing approach often bottlenecks retrieval performance. Through controlled experiments, we establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks. Models and code are made available.