https://huggingface.co/spaces/orrzohar/Video-STaR
To download VSTaR 1M: https://huggingface.co/datasets/orrzohar/Video-STaR
Code: https://github.com/orrzohar/Video-STaR
Project Page: https://orrzohar.github.io/projects/video-star/

\n","updatedAt":"2024-07-10T02:51:39.074Z","author":{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","fullname":"Orr Zohar","name":"orrzohar","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":104,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6711063981056213},"editors":["orrzohar"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg"],"reactions":[{"reaction":"🔥","users":["AdinaY","search-facility","RaushanTurganbay","Norm"],"count":4}],"isReport":false},"replies":[{"id":"66911eeae760dfd88db13bfc","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false},"createdAt":"2024-07-12T12:17:46.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Congrats on this work! Thanks so much for publishing artifacts on the hub + linking them to the paper :)","html":"

Congrats on this work! Thanks so much for publishing artifacts on the hub + linking them to the paper :)

\n","updatedAt":"2024-07-12T12:17:58.740Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9114669561386108},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[{"reaction":"🚀","users":["orrzohar"],"count":1}],"isReport":false,"parentCommentId":"668df73b3d0bc7c475c6011e"}}]},{"id":"668fb2fa4956745436f5b735","author":{"_id":"651c240a37fecec1fe96c60b","avatarUrl":"/avatars/5af52af97b7907e138efecac0f20799b.svg","fullname":"S.F.","name":"search-facility","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2024-07-11T10:24:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Smart!","html":"

Smart!

\n","updatedAt":"2024-07-11T10:24:58.933Z","author":{"_id":"651c240a37fecec1fe96c60b","avatarUrl":"/avatars/5af52af97b7907e138efecac0f20799b.svg","fullname":"S.F.","name":"search-facility","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.35974931716918945},"editors":["search-facility"],"editorAvatarUrls":["/avatars/5af52af97b7907e138efecac0f20799b.svg"],"reactions":[{"reaction":"🤗","users":["orrzohar"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.06189","authors":[{"_id":"668cb636d10c3be5d3576f8f","user":{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","isPro":true,"fullname":"Orr Zohar","user":"orrzohar","type":"user"},"name":"Orr Zohar","status":"claimed_verified","statusLastChangedAt":"2024-07-09T07:42:01.105Z","hidden":false},{"_id":"668cb636d10c3be5d3576f90","user":{"_id":"65703fab7f50602340d23704","avatarUrl":"/avatars/324c45f5fba9cd8c38a89b30427c06b4.svg","isPro":false,"fullname":"Xiaohan Wang","user":"nicholswang","type":"user"},"name":"Xiaohan Wang","status":"admin_assigned","statusLastChangedAt":"2024-07-10T10:34:31.848Z","hidden":false},{"_id":"668cb636d10c3be5d3576f91","user":{"_id":"632e0771ae0a7b1fc95630bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1663961181981-632e0771ae0a7b1fc95630bf.jpeg","isPro":false,"fullname":"Yonatan","user":"yonatanbitton","type":"user"},"name":"Yonatan Bitton","status":"admin_assigned","statusLastChangedAt":"2024-07-10T10:34:39.511Z","hidden":false},{"_id":"668cb636d10c3be5d3576f92","name":"Idan Szpektor","hidden":false},{"_id":"668cb636d10c3be5d3576f93","name":"Serena Yeung-Levy","hidden":false}],"publishedAt":"2024-07-08T17:59:42.000Z","submittedOnDailyAt":"2024-07-10T01:21:39.068Z","title":"Video-STaR: Self-Training Enables Video Instruction Tuning with Any\n Supervision","submittedOnDailyBy":{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","isPro":true,"fullname":"Orr Zohar","user":"orrzohar","type":"user"},"summary":"The performance of Large Vision Language Models (LVLMs) is dependent on the\nsize and quality of their training datasets. Existing video instruction tuning\ndatasets lack diversity as they are derived by prompting large language models\nwith video captions to generate question-answer pairs, and are therefore mostly\ndescriptive. Meanwhile, many labeled video datasets with diverse labels and\nsupervision exist - however, we find that their integration into LVLMs is\nnon-trivial. Herein, we present Video Self-Training with augmented Reasoning\n(Video-STaR), the first video self-training approach. Video-STaR allows the\nutilization of any labeled video dataset for video instruction tuning. In\nVideo-STaR, an LVLM cycles between instruction generation and finetuning, which\nwe show (I) improves general video understanding and (II) adapts LVLMs to novel\ndownstream tasks with existing supervision. During generation, an LVLM is\nprompted to propose an answer. The answers are then filtered only to those that\ncontain the original video labels, and the LVLM is then re-trained on the\ngenerated dataset. By only training on generated answers that contain the\ncorrect video labels, Video-STaR utilizes these existing video labels as weak\nsupervision for video instruction tuning. Our results demonstrate that\nVideo-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA,\nwhere TempCompass performance improved by 10%, and (II) on downstream tasks,\nwhere Video-STaR improved Kinetics700-QA accuracy by 20% and action quality\nassessment on FineDiving by 15%.","upvotes":27,"discussionId":"668cb637d10c3be5d357701a","githubRepo":"https://github.com/orrzohar/Video-STaR","githubRepoAddedBy":"auto","ai_summary":"Video Self-Training with augmented Reasoning (Video-STaR) enhances Large Vision Language Models by leveraging labeled video datasets for video instruction tuning, improving general video understanding and downstream task performance.","ai_keywords":["Large Vision Language Models","Video Self-Training","Video-STaR","video instruction tuning","instruction generation","weak supervision","video QA","Kinetics700-QA","FineDiving"],"githubStars":72},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630e56d6f6f6d700f50be105","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630e56d6f6f6d700f50be105/_gzPXZeL2LP8oalMCriNN.jpeg","isPro":false,"fullname":"Alejandro Lozano","user":"Alejandro98","type":"user"},{"_id":"62da55164398e21bf7f0e292","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62da55164398e21bf7f0e292/xjKkG8IA2IZZqCdjApSh3.jpeg","isPro":false,"fullname":"Yuhui Zhang","user":"yuhuizhang","type":"user"},{"_id":"61fb0c7bb3d6dbddda6dbe44","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643842677900-noauth.jpeg","isPro":false,"fullname":"Anonymous","user":"luulinh90s","type":"user"},{"_id":"65703fab7f50602340d23704","avatarUrl":"/avatars/324c45f5fba9cd8c38a89b30427c06b4.svg","isPro":false,"fullname":"Xiaohan Wang","user":"nicholswang","type":"user"},{"_id":"64f955c582673b2a07fbf0ad","avatarUrl":"/avatars/1c98c8be61f6580c1e4ee698fa5c0716.svg","isPro":false,"fullname":"hongyu","user":"learn12138","type":"user"},{"_id":"650871aeb44445e9b3625c7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650871aeb44445e9b3625c7b/mtx3EnkuNF4z29IosnhaQ.png","isPro":false,"fullname":"James Burgess","user":"jmhb","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"64680ec8efbd7ae309749b8a","avatarUrl":"/avatars/d38ff11ce1678c186e6452f0259992fc.svg","isPro":false,"fullname":"Yonatan Bitton","user":"Yonatan-Bitton","type":"user"},{"_id":"632e0771ae0a7b1fc95630bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1663961181981-632e0771ae0a7b1fc95630bf.jpeg","isPro":false,"fullname":"Yonatan","user":"yonatanbitton","type":"user"},{"_id":"65cedb3e6712e2451ee14ac2","avatarUrl":"/avatars/c34c8c71e7a45180bbf1684ad5b3f23c.svg","isPro":true,"fullname":"Zhipeng Bao","user":"bzp15","type":"user"},{"_id":"667caf5baeba4a9f63860cf3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/667caf5baeba4a9f63860cf3/p4D4IXN1JgHwI8RGe5p6F.jpeg","isPro":false,"fullname":"Shiyu Zhao","user":"xiaofeng-94","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2407.06189

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Published on Jul 8, 2024

· Submitted by

Orr Zohar on Jul 10, 2024

Upvote

Authors:

Orr Zohar ,

Xiaohan Wang ,

Yonatan Bitton ,

Abstract

Video Self-Training with augmented Reasoning (Video-STaR) enhances Large Vision Language Models by leveraging labeled video datasets for video instruction tuning, improving general video understanding and downstream task performance.

AI-generated summary

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.