Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-11T01:43:11.867Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7224732041358948},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.08439","authors":[{"_id":"698ab3e51b2dc6b37d61afaa","user":{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","isPro":true,"fullname":"Yuhao Dong","user":"THUdyh","type":"user"},"name":"Yuhao Dong","status":"claimed_verified","statusLastChangedAt":"2026-02-10T09:06:16.048Z","hidden":false},{"_id":"698ab3e51b2dc6b37d61afab","name":"Shulin Tian","hidden":false},{"_id":"698ab3e51b2dc6b37d61afac","name":"Shuai Liu","hidden":false},{"_id":"698ab3e51b2dc6b37d61afad","name":"Shuangrui Ding","hidden":false},{"_id":"698ab3e51b2dc6b37d61afae","name":"Yuhang Zang","hidden":false},{"_id":"698ab3e51b2dc6b37d61afaf","name":"Xiaoyi Dong","hidden":false},{"_id":"698ab3e51b2dc6b37d61afb0","name":"Yuhang Cao","hidden":false},{"_id":"698ab3e51b2dc6b37d61afb1","user":{"_id":"64b4eec4faa3181a5eab9c46","avatarUrl":"/avatars/bcc9bf5cbf67546ad2b4c9ec8b96ac96.svg","isPro":true,"fullname":"Jiaqi Wang","user":"myownskyW7","type":"user"},"name":"Jiaqi Wang","status":"claimed_verified","statusLastChangedAt":"2026-02-17T15:50:28.890Z","hidden":false},{"_id":"698ab3e51b2dc6b37d61afb2","name":"Ziwei Liu","hidden":false}],"publishedAt":"2026-02-09T09:51:29.000Z","submittedOnDailyAt":"2026-02-10T02:03:30.996Z","title":"Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition","submittedOnDailyBy":{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","isPro":true,"fullname":"Yuhao Dong","user":"THUdyh","type":"user"},"summary":"Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.","upvotes":28,"discussionId":"698ab3e51b2dc6b37d61afb3","githubRepo":"https://github.com/dongyh20/Demo-ICL","githubRepoAddedBy":"user","ai_summary":"Researchers introduce a new video understanding task and benchmark that evaluates models' ability to learn from few-shot demonstrations, along with a specialized MLLM architecture trained using a two-stage approach combining video supervision and preference optimization.","ai_keywords":["Multimodal Large Language Models","video understanding","in-context learning","video benchmarks","Demo-ICL-Bench","video-supervised fine-tuning","direct preference optimization"],"githubStars":32,"organization":{"_id":"62d55f243bf5e059f7ca25ba","name":"mmlab-ntu","fullname":"MMLab@NTU","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1658151991971-62b5777f593a2c49da69dc02.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","isPro":true,"fullname":"Yuhao Dong","user":"THUdyh","type":"user"},{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},{"_id":"64f7f5b54101c731ca84ae05","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f7f5b54101c731ca84ae05/13DwdxOo3tWbxKDLd44B9.jpeg","isPro":false,"fullname":"Shuai Liu","user":"Choiszt","type":"user"},{"_id":"66c7e8d021fc3eab824eab70","avatarUrl":"/avatars/38f1a00850e9ebac5007b8056c0a071f.svg","isPro":false,"fullname":"YANG Zhe","user":"Savannah-yz","type":"user"},{"_id":"6658d01c6f1a71ba56d6c273","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/tc4nZrMuZQLfgt5aVxtH4.jpeg","isPro":false,"fullname":"Tian Shulin","user":"shulin16","type":"user"},{"_id":"6944fd60b2d6dd4b548b2a6f","avatarUrl":"/avatars/b78aa04118f9978b38d566c234abf416.svg","isPro":false,"fullname":"Li Sihan","user":"aflicuried","type":"user"},{"_id":"649369b34f0e40ee1a0ed5ba","avatarUrl":"/avatars/50d0e77883579d5002906c8d29c26ec5.svg","isPro":false,"fullname":"Maxwell Yao","user":"MaxwellJryao","type":"user"},{"_id":"67c28324cfdcb62c548ce862","avatarUrl":"/avatars/4d841fa9cc3812088777a7c9f4d70cbf.svg","isPro":false,"fullname":"Jiahe Zhang","user":"jiahezh","type":"user"},{"_id":"66aa94cbd59743aa4a65646f","avatarUrl":"/avatars/6a59a1a9ac2b509cb93d679111f29d10.svg","isPro":false,"fullname":"Yao Runmao","user":"yaorunmao","type":"user"},{"_id":"698aba445866ffbd331089b9","avatarUrl":"/avatars/24cd2d4f114102a76eac7cddd17b9fcd.svg","isPro":false,"fullname":"Liu Hao","user":"LiuHaoTHU","type":"user"},{"_id":"698abb0e65003a8a14b23142","avatarUrl":"/avatars/4db0a6b05e5c2c359e96189caa67428e.svg","isPro":false,"fullname":"wu","user":"Asher213","type":"user"},{"_id":"6986c5a234b541182f813d40","avatarUrl":"/avatars/963d8a94cc7c41a85f82320b8ef4545d.svg","isPro":false,"fullname":"JohnLiu","user":"lolrn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"62d55f243bf5e059f7ca25ba","name":"mmlab-ntu","fullname":"MMLab@NTU","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1658151991971-62b5777f593a2c49da69dc02.png"}}">
Researchers introduce a new video understanding task and benchmark that evaluates models' ability to learn from few-shot demonstrations, along with a specialized MLLM architecture trained using a two-stage approach combining video supervision and preference optimization.
AI-generated summary
Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.