Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Enhancing Vision-Language Pre-training with Rich Supervisions
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-03-08T01:21:44.647Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7321822643280029},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["wenwenyu","cjfcsjt","shikunyu8"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2403.03346","authors":[{"_id":"65e93d92c331a092145618f8","name":"Yuan Gao","hidden":false},{"_id":"65e93d92c331a092145618f9","user":{"_id":"65e9ea877c911508683a27a2","avatarUrl":"/avatars/eb0a6155a6bdce7d62d9b04cc9fedafa.svg","isPro":false,"fullname":"Kunyu Shi","user":"shikunyu8","type":"user"},"name":"Kunyu Shi","status":"claimed_verified","statusLastChangedAt":"2024-03-07T16:46:06.124Z","hidden":false},{"_id":"65e93d92c331a092145618fa","name":"Pengkai Zhu","hidden":false},{"_id":"65e93d92c331a092145618fb","name":"Edouard Belval","hidden":false},{"_id":"65e93d92c331a092145618fc","name":"Oren Nuriel","hidden":false},{"_id":"65e93d92c331a092145618fd","name":"Srikar Appalaraju","hidden":false},{"_id":"65e93d92c331a092145618fe","name":"Shabnam Ghadar","hidden":false},{"_id":"65e93d92c331a092145618ff","name":"Vijay Mahadevan","hidden":false},{"_id":"65e93d92c331a09214561900","name":"Zhuowen Tu","hidden":false},{"_id":"65e93d92c331a09214561901","name":"Stefano Soatto","hidden":false}],"publishedAt":"2024-03-05T22:14:58.000Z","submittedOnDailyAt":"2024-03-07T01:37:48.377Z","title":"Enhancing Vision-Language Pre-training with Rich Supervisions","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel\npre-training paradigm for Vision-Language Models using data from large-scale\nweb screenshot rendering. Using web screenshots unlocks a treasure trove of\nvisual and textual cues that are not present in using image-text pairs. In S4,\nwe leverage the inherent tree-structured hierarchy of HTML elements and the\nspatial localization to carefully design 10 pre-training tasks with large scale\nannotated data. These tasks resemble downstream tasks across different domains\nand the annotations are cheap to obtain. We demonstrate that, compared to\ncurrent screenshot pre-training objectives, our innovative pre-training method\nsignificantly enhances performance of image-to-text model in nine varied and\npopular downstream tasks - up to 76.1% improvements on Table Detection, and at\nleast 1% on Widget Captioning.","upvotes":17,"discussionId":"65e93d94c331a09214561953","ai_summary":"Strongly Supervised pre-training with ScreenShots (S4) enhances Vision-Language Models by using web screenshots with tree-structured HTML elements and spatial localization, resulting in significant improvements in diverse downstream tasks.","ai_keywords":["Strongly Supervised pre-training","ScreenShots","Vision-Language Models","web screenshots","tree-structured hierarchy","HTML elements","spatial localization","pre-training tasks","image-to-text model","Table Detection","Widget Captioning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63869d1e81fe8c678a3a9422","avatarUrl":"/avatars/3bb8728057fa2ba0e24f5ceb1600068d.svg","isPro":true,"fullname":"Zach Mustafa","user":"Zmu","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65e7151ef7c2e46887e225b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65e7151ef7c2e46887e225b1/NXj_CrUkzwdUT8T3a-H8N.jpeg","isPro":false,"fullname":"Grigorii Alekseenko","user":"Riko0","type":"user"},{"_id":"63f1d16fbe95ed4c9a9418fe","avatarUrl":"/avatars/a1bdfa97323693808f2f16ec74698ed3.svg","isPro":false,"fullname":"Yang Yue","user":"yueyang2000","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"65e9ea877c911508683a27a2","avatarUrl":"/avatars/eb0a6155a6bdce7d62d9b04cc9fedafa.svg","isPro":false,"fullname":"Kunyu Shi","user":"shikunyu8","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"63166e78894404e2506c085f","avatarUrl":"/avatars/b72f5dec2bbc31fa6ec25d187792f014.svg","isPro":false,"fullname":"2097","user":"forwards","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"640125ab0251f0f411688c24","avatarUrl":"/avatars/5f34286b4421d40c229c37deda815f18.svg","isPro":false,"fullname":"Keane Moraes","user":"lordvader31","type":"user"},{"_id":"657217faabb25ed8aedd5e48","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657217faabb25ed8aedd5e48/UUHAXeGtOnQBXFD3nYtf2.jpeg","isPro":false,"fullname":"Vlad Bogolin","user":"vladbogo","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2403.03346

Enhancing Vision-Language Pre-training with Rich Supervisions

Published on Mar 5, 2024
· Submitted by
AK
on Mar 7, 2024
Authors:
,
,
,
,
,
,
,
,

Abstract

Strongly Supervised pre-training with ScreenShots (S4) enhances Vision-Language Models by using web screenshots with tree-structured HTML elements and spatial localization, resulting in significant improvements in diverse downstream tasks.

AI-generated summary

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2403.03346 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.03346 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2403.03346 in a Space README.md to link it from this page.

Collections including this paper 4