Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding (2024)
ScreenAI: A Vision-Language Model for UI and Infographics Understanding (2024)
Supervised Fine-tuning in turn Improves Visual Foundation Models (2024)
RegionGPT: Towards Region Understanding Vision Language Model (2024)
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models (2024)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-03-08T01:21:44.647Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7321822643280029},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["wenwenyu","cjfcsjt","shikunyu8"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2403.03346","authors":[{"_id":"65e93d92c331a092145618f8","name":"Yuan Gao","hidden":false},{"_id":"65e93d92c331a092145618f9","user":{"_id":"65e9ea877c911508683a27a2","avatarUrl":"/avatars/eb0a6155a6bdce7d62d9b04cc9fedafa.svg","isPro":false,"fullname":"Kunyu Shi","user":"shikunyu8","type":"user"},"name":"Kunyu Shi","status":"claimed_verified","statusLastChangedAt":"2024-03-07T16:46:06.124Z","hidden":false},{"_id":"65e93d92c331a092145618fa","name":"Pengkai Zhu","hidden":false},{"_id":"65e93d92c331a092145618fb","name":"Edouard Belval","hidden":false},{"_id":"65e93d92c331a092145618fc","name":"Oren Nuriel","hidden":false},{"_id":"65e93d92c331a092145618fd","name":"Srikar Appalaraju","hidden":false},{"_id":"65e93d92c331a092145618fe","name":"Shabnam Ghadar","hidden":false},{"_id":"65e93d92c331a092145618ff","name":"Vijay Mahadevan","hidden":false},{"_id":"65e93d92c331a09214561900","name":"Zhuowen Tu","hidden":false},{"_id":"65e93d92c331a09214561901","name":"Stefano Soatto","hidden":false}],"publishedAt":"2024-03-05T22:14:58.000Z","submittedOnDailyAt":"2024-03-07T01:37:48.377Z","title":"Enhancing Vision-Language Pre-training with Rich Supervisions","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel\npre-training paradigm for Vision-Language Models using data from large-scale\nweb screenshot rendering. Using web screenshots unlocks a treasure trove of\nvisual and textual cues that are not present in using image-text pairs. In S4,\nwe leverage the inherent tree-structured hierarchy of HTML elements and the\nspatial localization to carefully design 10 pre-training tasks with large scale\nannotated data. These tasks resemble downstream tasks across different domains\nand the annotations are cheap to obtain. We demonstrate that, compared to\ncurrent screenshot pre-training objectives, our innovative pre-training method\nsignificantly enhances performance of image-to-text model in nine varied and\npopular downstream tasks - up to 76.1% improvements on Table Detection, and at\nleast 1% on Widget Captioning.","upvotes":17,"discussionId":"65e93d94c331a09214561953","ai_summary":"Strongly Supervised pre-training with ScreenShots (S4) enhances Vision-Language Models by using web screenshots with tree-structured HTML elements and spatial localization, resulting in significant improvements in diverse downstream tasks.","ai_keywords":["Strongly Supervised pre-training","ScreenShots","Vision-Language Models","web screenshots","tree-structured hierarchy","HTML elements","spatial localization","pre-training tasks","image-to-text model","Table Detection","Widget Captioning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63869d1e81fe8c678a3a9422","avatarUrl":"/avatars/3bb8728057fa2ba0e24f5ceb1600068d.svg","isPro":true,"fullname":"Zach Mustafa","user":"Zmu","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65e7151ef7c2e46887e225b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65e7151ef7c2e46887e225b1/NXj_CrUkzwdUT8T3a-H8N.jpeg","isPro":false,"fullname":"Grigorii Alekseenko","user":"Riko0","type":"user"},{"_id":"63f1d16fbe95ed4c9a9418fe","avatarUrl":"/avatars/a1bdfa97323693808f2f16ec74698ed3.svg","isPro":false,"fullname":"Yang Yue","user":"yueyang2000","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"65e9ea877c911508683a27a2","avatarUrl":"/avatars/eb0a6155a6bdce7d62d9b04cc9fedafa.svg","isPro":false,"fullname":"Kunyu Shi","user":"shikunyu8","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"63166e78894404e2506c085f","avatarUrl":"/avatars/b72f5dec2bbc31fa6ec25d187792f014.svg","isPro":false,"fullname":"2097","user":"forwards","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"640125ab0251f0f411688c24","avatarUrl":"/avatars/5f34286b4421d40c229c37deda815f18.svg","isPro":false,"fullname":"Keane Moraes","user":"lordvader31","type":"user"},{"_id":"657217faabb25ed8aedd5e48","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657217faabb25ed8aedd5e48/UUHAXeGtOnQBXFD3nYtf2.jpeg","isPro":false,"fullname":"Vlad Bogolin","user":"vladbogo","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2403.03346

Enhancing Vision-Language Pre-training with Rich Supervisions

Published on Mar 5, 2024

· Submitted by

AK on Mar 7, 2024

Upvote

Authors:

Kunyu Shi ,

Abstract

Strongly Supervised pre-training with ScreenShots (S4) enhances Vision-Language Models by using web screenshots with tree-structured HTML elements and spatial localization, resulting in significant improvements in diverse downstream tasks.

AI-generated summary

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.