Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Enhancing Vision-Language Pre-training with Rich Supervisions
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-03-08T01:21:44.647Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7321822643280029},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["wenwenyu","cjfcsjt","shikunyu8"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2403.03346","authors":[{"_id":"65e93d92c331a092145618f8","name":"Yuan Gao","hidden":false},{"_id":"65e93d92c331a092145618f9","user":{"_id":"65e9ea877c911508683a27a2","avatarUrl":"/avatars/eb0a6155a6bdce7d62d9b04cc9fedafa.svg","isPro":false,"fullname":"Kunyu Shi","user":"shikunyu8","type":"user"},"name":"Kunyu Shi","status":"claimed_verified","statusLastChangedAt":"2024-03-07T16:46:06.124Z","hidden":false},{"_id":"65e93d92c331a092145618fa","name":"Pengkai Zhu","hidden":false},{"_id":"65e93d92c331a092145618fb","name":"Edouard Belval","hidden":false},{"_id":"65e93d92c331a092145618fc","name":"Oren Nuriel","hidden":false},{"_id":"65e93d92c331a092145618fd","name":"Srikar Appalaraju","hidden":false},{"_id":"65e93d92c331a092145618fe","name":"Shabnam Ghadar","hidden":false},{"_id":"65e93d92c331a092145618ff","name":"Vijay Mahadevan","hidden":false},{"_id":"65e93d92c331a09214561900","name":"Zhuowen Tu","hidden":false},{"_id":"65e93d92c331a09214561901","name":"Stefano Soatto","hidden":false}],"publishedAt":"2024-03-05T22:14:58.000Z","submittedOnDailyAt":"2024-03-07T01:37:48.377Z","title":"Enhancing Vision-Language Pre-training with Rich Supervisions","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel\npre-training paradigm for Vision-Language Models using data from large-scale\nweb screenshot rendering. Using web screenshots unlocks a treasure trove of\nvisual and textual cues that are not present in using image-text pairs. In S4,\nwe leverage the inherent tree-structured hierarchy of HTML elements and the\nspatial localization to carefully design 10 pre-training tasks with large scale\nannotated data. These tasks resemble downstream tasks across different domains\nand the annotations are cheap to obtain. We demonstrate that, compared to\ncurrent screenshot pre-training objectives, our innovative pre-training method\nsignificantly enhances performance of image-to-text model in nine varied and\npopular downstream tasks - up to 76.1% improvements on Table Detection, and at\nleast 1% on Widget Captioning.","upvotes":17,"discussionId":"65e93d94c331a09214561953","ai_summary":"Strongly Supervised pre-training with ScreenShots (S4) enhances Vision-Language Models by using web screenshots with tree-structured HTML elements and spatial localization, resulting in significant improvements in diverse downstream tasks.","ai_keywords":["Strongly Supervised pre-training","ScreenShots","Vision-Language Models","web screenshots","tree-structured hierarchy","HTML elements","spatial localization","pre-training tasks","image-to-text model","Table Detection","Widget Captioning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63869d1e81fe8c678a3a9422","avatarUrl":"/avatars/3bb8728057fa2ba0e24f5ceb1600068d.svg","isPro":true,"fullname":"Zach Mustafa","user":"Zmu","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65e7151ef7c2e46887e225b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65e7151ef7c2e46887e225b1/NXj_CrUkzwdUT8T3a-H8N.jpeg","isPro":false,"fullname":"Grigorii Alekseenko","user":"Riko0","type":"user"},{"_id":"63f1d16fbe95ed4c9a9418fe","avatarUrl":"/avatars/a1bdfa97323693808f2f16ec74698ed3.svg","isPro":false,"fullname":"Yang Yue","user":"yueyang2000","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"65e9ea877c911508683a27a2","avatarUrl":"/avatars/eb0a6155a6bdce7d62d9b04cc9fedafa.svg","isPro":false,"fullname":"Kunyu Shi","user":"shikunyu8","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"63166e78894404e2506c085f","avatarUrl":"/avatars/b72f5dec2bbc31fa6ec25d187792f014.svg","isPro":false,"fullname":"2097","user":"forwards","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"640125ab0251f0f411688c24","avatarUrl":"/avatars/5f34286b4421d40c229c37deda815f18.svg","isPro":false,"fullname":"Keane Moraes","user":"lordvader31","type":"user"},{"_id":"657217faabb25ed8aedd5e48","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657217faabb25ed8aedd5e48/UUHAXeGtOnQBXFD3nYtf2.jpeg","isPro":false,"fullname":"Vlad Bogolin","user":"vladbogo","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Strongly Supervised pre-training with ScreenShots (S4) enhances Vision-Language Models by using web screenshots with tree-structured HTML elements and spatial localization, resulting in significant improvements in diverse downstream tasks.
AI-generated summary
We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel
pre-training paradigm for Vision-Language Models using data from large-scale
web screenshot rendering. Using web screenshots unlocks a treasure trove of
visual and textual cues that are not present in using image-text pairs. In S4,
we leverage the inherent tree-structured hierarchy of HTML elements and the
spatial localization to carefully design 10 pre-training tasks with large scale
annotated data. These tasks resemble downstream tasks across different domains
and the annotations are cheap to obtain. We demonstrate that, compared to
current screenshot pre-training objectives, our innovative pre-training method
significantly enhances performance of image-to-text model in nine varied and
popular downstream tasks - up to 76.1% improvements on Table Detection, and at
least 1% on Widget Captioning.