Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Adapting Vision-Language Models for E-commerce Understanding at Scale
\n \n \n Figure 1: Output of our E-commerce Adapted VLMs compared against same size LLaVA-OneVision. We show our models ability to more faithfully extract attributes from e-commerce items. In red, we highlight wrong model predictions that are neither tied to the image nor valid item attributes. \n \n","html":"
\n\n \n \n Figure 1: Output of our E-commerce Adapted VLMs compared against same size LLaVA-OneVision. We show our models ability to more faithfully extract attributes from e-commerce items. In red, we highlight wrong model predictions that are neither tied to the image nor valid item attributes. \n \n","updatedAt":"2026-02-13T22:39:13.560Z","author":{"_id":"661d4e74b8f13412f6d48a50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png","fullname":"Matteo Nulli","name":"MatteoNulli","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":5,"identifiedLanguage":{"language":"en","probability":0.7285663485527039},"editors":["MatteoNulli"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png"],"reactions":[],"isReport":false}},{"id":"698ef325ac54d3a59f238002","author":{"_id":"661d4e74b8f13412f6d48a50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png","fullname":"Matteo Nulli","name":"MatteoNulli","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-13T09:47:17.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"\n\n \n \n Figure 2: Visual Verification Pipeline. The figure shows the pipeline we use to create the 4M e-commerce visual instruction tuning data. We begin by collecting raw listings data from the web (left). We then clean and pre-process the textual entries. In parallel, we create detailed captions for the corresponding image through InternVL-2.5-26B. Finally, we provide the captions together with the cleaned listings to Mistral-Small-3-24B to obtain the verified instructions, used, along with original images, to train our models 🔥.\n \n","html":"
\n\n \n \n Figure 2: Visual Verification Pipeline. The figure shows the pipeline we use to create the 4M e-commerce visual instruction tuning data. We begin by collecting raw listings data from the web (left). We then clean and pre-process the textual entries. In parallel, we create detailed captions for the corresponding image through InternVL-2.5-26B. Finally, we provide the captions together with the cleaned listings to Mistral-Small-3-24B to obtain the verified instructions, used, along with original images, to train our models 🔥.\n \n","updatedAt":"2026-02-13T09:56:08.752Z","author":{"_id":"661d4e74b8f13412f6d48a50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png","fullname":"Matteo Nulli","name":"MatteoNulli","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":20,"identifiedLanguage":{"language":"en","probability":0.5098966956138611},"editors":["MatteoNulli"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png"],"reactions":[],"isReport":false}},{"id":"698ef5a7ac54d3a59f23e7b8","author":{"_id":"661d4e74b8f13412f6d48a50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png","fullname":"Matteo Nulli","name":"MatteoNulli","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-13T09:57:59.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"\n\n \n \n Figure 3: eBay Single-Image Visual Instruction Tuning Set. We break down the components of our internal single-image instruction tuning set. The pie chart on the left shows the percentages of tasks in our set. On the right we report each tasks with its own sub tasks with the total number of instructions in parenthesis.\n \n","html":"
\n\n \n \n Figure 3: eBay Single-Image Visual Instruction Tuning Set. We break down the components of our internal single-image instruction tuning set. The pie chart on the left shows the percentages of tasks in our set. On the right we report each tasks with its own sub tasks with the total number of instructions in parenthesis.\n \n","updatedAt":"2026-02-13T09:58:06.087Z","author":{"_id":"661d4e74b8f13412f6d48a50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png","fullname":"Matteo Nulli","name":"MatteoNulli","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.4739989936351776},"editors":["MatteoNulli"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png"],"reactions":[],"isReport":false}},{"id":"698fd2563a18b48742f827ac","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-14T01:39:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [DAVE: A VLM Vision Encoder for Document Understanding and Web Agents](https://huggingface.co/papers/2512.17221) (2025)\n* [Vision-aligned Latent Reasoning for Multi-modal Large Language Model](https://huggingface.co/papers/2602.04476) (2026)\n* [Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores](https://huggingface.co/papers/2601.21342) (2026)\n* [Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models](https://huggingface.co/papers/2512.14926) (2025)\n* [E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs](https://huggingface.co/papers/2602.08355) (2026)\n* [Benchmarking Multimodal Large Language Models for Missing Modality Completion in Product Catalogues](https://huggingface.co/papers/2601.19750) (2026)\n* [RexBERT: Context Specialized Bidirectional Encoders for E-commerce](https://huggingface.co/papers/2602.04605) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-14T01:39:34.649Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6979897618293762},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.11733","authors":[{"_id":"698eef93cace060ff123af77","user":{"_id":"661d4e74b8f13412f6d48a50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png","isPro":false,"fullname":"Matteo Nulli","user":"MatteoNulli","type":"user"},"name":"Matteo Nulli","status":"claimed_verified","statusLastChangedAt":"2026-02-17T15:51:19.518Z","hidden":false},{"_id":"698eef93cace060ff123af78","name":"Vladimir Orshulevich","hidden":false},{"_id":"698eef93cace060ff123af79","name":"Tala Bazazo","hidden":false},{"_id":"698eef93cace060ff123af7a","name":"Christian Herold","hidden":false},{"_id":"698eef93cace060ff123af7b","name":"Michael Kozielski","hidden":false},{"_id":"698eef93cace060ff123af7c","name":"Marcin Mazur","hidden":false},{"_id":"698eef93cace060ff123af7d","name":"Szymon Tuzel","hidden":false},{"_id":"698eef93cace060ff123af7e","name":"Cees G. M. Snoek","hidden":false},{"_id":"698eef93cace060ff123af7f","name":"Seyyed Hadi Hashemi","hidden":false},{"_id":"698eef93cace060ff123af80","name":"Omar Javed","hidden":false},{"_id":"698eef93cace060ff123af81","name":"Yannick Versley","hidden":false},{"_id":"698eef93cace060ff123af82","name":"Shahram Khadivi","hidden":false}],"publishedAt":"2026-02-12T08:59:22.000Z","submittedOnDailyAt":"2026-02-13T07:11:25.391Z","title":"Adapting Vision-Language Models for E-commerce Understanding at Scale","submittedOnDailyBy":{"_id":"661d4e74b8f13412f6d48a50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png","isPro":false,"fullname":"Matteo Nulli","user":"MatteoNulli","type":"user"},"summary":"E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.","upvotes":12,"discussionId":"698eef93cace060ff123af83","ai_summary":"General-purpose Vision-Language Models can be effectively adapted for e-commerce applications through targeted techniques that enhance product understanding while maintaining broad multimodal capabilities.","ai_keywords":["Vision-Language Models","multimodal comprehension","e-commerce data","attribute-centric","multi-image","noisy data","generalizable multimodal latent modelling","targeted adaptation","deep product understanding","strict instruction following","dynamic attribute extraction"],"organization":{"_id":"632dd24dfdb35759ea67fc31","name":"eBay","fullname":"eBay","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1663947365758-632dd20ffdb35759ea67f9a1.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"661d4e74b8f13412f6d48a50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661d4e74b8f13412f6d48a50/xqnxU1VhOipNnsNlbU6ji.png","isPro":false,"fullname":"Matteo Nulli","user":"MatteoNulli","type":"user"},{"_id":"68279a15ba94833770740288","avatarUrl":"/avatars/4262920b815e0e2430f8a20cf93503e3.svg","isPro":false,"fullname":"Christian Herold","user":"cherold3141","type":"user"},{"_id":"65291d3a90f06dd8a8ef42dc","avatarUrl":"/avatars/0a36e313a71729e1b257b0b157b1d837.svg","isPro":false,"fullname":"Michael Kozielski","user":"mickoz84","type":"user"},{"_id":"67b4b7262b874b93dab35210","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ZTHfaGdEcLFr9nrkvJuba.png","isPro":false,"fullname":"Yannick Versley (eBay)","user":"yversley-ebay","type":"user"},{"_id":"64aeb19959d35c5f81802e10","avatarUrl":"/avatars/c121ed8034a3b28b77ed5198820112d9.svg","isPro":false,"fullname":"K M","user":"km-b","type":"user"},{"_id":"63a852ec353e10031a8be92e","avatarUrl":"/avatars/42ff79115c7b84fa00e7c5df373dc77d.svg","isPro":false,"fullname":"Stefan Petkov Vasilev","user":"stefanvasilev","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"68a724821e1fa867e36666e5","avatarUrl":"/avatars/aa85d3b06672b50871a8f8ebcf11a443.svg","isPro":false,"fullname":"TrendWatching","user":"TrendWatching","type":"user"},{"_id":"65d53f8f36155e69ea572e2a","avatarUrl":"/avatars/bf59ed40774659443acacf0311787f89.svg","isPro":false,"fullname":"Omar Javed","user":"ojaved","type":"user"},{"_id":"6401cf6df98fbc64bcd82098","avatarUrl":"/avatars/aecac11720ac66fcedeb179d58f0312a.svg","isPro":false,"fullname":"Razvan Matisan","user":"razvan-matisan","type":"user"},{"_id":"63b3f7019b0c6fa1de80118b","avatarUrl":"/avatars/26b5c7208430be76a1b3fc0010c58c1b.svg","isPro":false,"fullname":"Gilad Fuchs","user":"GiladFuchs","type":"user"},{"_id":"662414cf2419feed62b25a4a","avatarUrl":"/avatars/01c33b5818e38f54a591e532e895b148.svg","isPro":false,"fullname":"Cimrman","user":"andelka","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"632dd24dfdb35759ea67fc31","name":"eBay","fullname":"eBay","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1663947365758-632dd20ffdb35759ea67f9a1.jpeg"}}">
General-purpose Vision-Language Models can be effectively adapted for e-commerce applications through targeted techniques that enhance product understanding while maintaining broad multimodal capabilities.
Figure 1: Output of our E-commerce Adapted VLMs compared against same size LLaVA-OneVision. We show our models ability to more faithfully extract attributes from e-commerce items. In red, we highlight wrong model predictions that are neither tied to the image nor valid item attributes.
Figure 2: Visual Verification Pipeline. The figure shows the pipeline we use to create the 4M e-commerce visual instruction tuning data. We begin by collecting raw listings data from the web (left). We then clean and pre-process the textual entries. In parallel, we create detailed captions for the corresponding image through InternVL-2.5-26B. Finally, we provide the captions together with the cleaned listings to Mistral-Small-3-24B to obtain the verified instructions, used, along with original images, to train our models 🔥.
Figure 3: eBay Single-Image Visual Instruction Tuning Set. We break down the components of our internal single-image instruction tuning set. The pie chart on the left shows the percentages of tasks in our set. On the right we report each tasks with its own sub tasks with the total number of instructions in parenthesis.