Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - CoS: Chain-of-Shot Prompting for Long Video Understanding
[go: Go Back, main page]

https://lwpyh.github.io/CoS/,
Code link: https://github.com/lwpyh/CoS_codes

\n","updatedAt":"2025-02-12T12:51:18.939Z","author":{"_id":"65e1b6e9501590df0173cbd3","avatarUrl":"/avatars/a73e2139700e23eff455734c99cef5ba.svg","fullname":"Jian Hu","name":"lwpyh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.718449592590332},"editors":["lwpyh"],"editorAvatarUrls":["/avatars/a73e2139700e23eff455734c99cef5ba.svg"],"reactions":[],"isReport":false}},{"id":"67ad4c3432ea9bda4152e6a3","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-02-13T01:34:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling](https://huggingface.co/papers/2501.12386) (2025)\n* [VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling](https://huggingface.co/papers/2501.00574) (2024)\n* [ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding](https://huggingface.co/papers/2412.20504) (2024)\n* [The Devil is in Temporal Token: High Quality Video Reasoning Segmentation](https://huggingface.co/papers/2501.08549) (2025)\n* [Temporal Preference Optimization for Long-Form Video Understanding](https://huggingface.co/papers/2501.13919) (2025)\n* [VidCtx: Context-aware Video Question Answering with Image Models](https://huggingface.co/papers/2412.17415) (2024)\n* [VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos](https://huggingface.co/papers/2502.01549) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-02-13T01:34:44.491Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6999425292015076},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.06428","authors":[{"_id":"67ac99089e12456bdb1d2e9d","user":{"_id":"65e1b6e9501590df0173cbd3","avatarUrl":"/avatars/a73e2139700e23eff455734c99cef5ba.svg","isPro":false,"fullname":"Jian Hu","user":"lwpyh","type":"user"},"name":"Jian Hu","status":"claimed_verified","statusLastChangedAt":"2025-02-19T11:12:25.146Z","hidden":false},{"_id":"67ac99089e12456bdb1d2e9e","name":"Zixu Cheng","hidden":false},{"_id":"67ac99089e12456bdb1d2e9f","user":{"_id":"635f8ed47c05eb9f59963d3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635f8ed47c05eb9f59963d3a/uQf4p9N9pSaFy87Wg9v4k.jpeg","isPro":false,"fullname":"ChenyangSi","user":"ChenyangSi","type":"user"},"name":"Chenyang Si","status":"claimed_verified","statusLastChangedAt":"2025-12-16T09:50:15.809Z","hidden":false},{"_id":"67ac99089e12456bdb1d2ea0","name":"Wei Li","hidden":false},{"_id":"67ac99089e12456bdb1d2ea1","name":"Shaogang Gong","hidden":false}],"publishedAt":"2025-02-10T13:03:05.000Z","submittedOnDailyAt":"2025-02-12T10:21:18.930Z","title":"CoS: Chain-of-Shot Prompting for Long Video Understanding","submittedOnDailyBy":{"_id":"65e1b6e9501590df0173cbd3","avatarUrl":"/avatars/a73e2139700e23eff455734c99cef5ba.svg","isPro":false,"fullname":"Jian Hu","user":"lwpyh","type":"user"},"summary":"Multi-modal Large Language Models (MLLMs) struggle with long videos due to\nthe need for excessive visual tokens. These tokens exceed massively the context\nlength of MLLMs, resulting in filled by redundant task-irrelevant shots. How to\nselect shots is an unsolved critical problem: sparse sampling risks missing key\ndetails, while exhaustive sampling overwhelms the model with irrelevant\ncontent, leading to video misunderstanding. To solve this problem, we propose\nChain-of-Shot prompting (CoS). The key idea is to frame shot selection as\ntest-time visual prompt optimisation, choosing shots adaptive to video\nunderstanding semantic task by optimising shots-task alignment. CoS has two key\nparts: (1) a binary video summary mechanism that performs pseudo temporal\ngrounding, discovering a binary coding to identify task-relevant shots, and (2)\na video co-reasoning module that deploys the binary coding to pair (learning to\nalign) task-relevant positive shots with irrelevant negative shots. It embeds\nthe optimised shot selections into the original video, facilitating a focus on\nrelevant context to optimize long video understanding. Experiments across three\nbaselines and five datasets demonstrate the effectiveness and adaptability of\nCoS. Code given in https://lwpyh.github.io/CoS.","upvotes":10,"discussionId":"67ac990b9e12456bdb1d2efe","projectPage":"https://lwpyh.github.io/CoS/","githubRepo":"https://github.com/lwpyh/CoS_codes","githubRepoAddedBy":"auto","ai_summary":"Chain-of-Shot prompting optimizes shot selection for long video understanding by aligning relevant shots with the video's semantic task, enhancing model performance without redundancy.","ai_keywords":["multi-modal large language models","MLLMs","long videos","visual tokens","context length","sparse sampling","exhaustive sampling","shot selection","test-time visual prompt optimisation","shots-task alignment","binary video summary","pseudo temporal grounding","video co-reasoning module","task-relevant shots","irrelevant shots","long video understanding"],"githubStars":53},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65e1b6e9501590df0173cbd3","avatarUrl":"/avatars/a73e2139700e23eff455734c99cef5ba.svg","isPro":false,"fullname":"Jian Hu","user":"lwpyh","type":"user"},{"_id":"667ee096b0fad0fdee319ed4","avatarUrl":"/avatars/d9df687e8522d47f7fcefe40fd9b575b.svg","isPro":false,"fullname":"Zixu Cheng","user":"Cade921","type":"user"},{"_id":"650c8bfb3d3542884da1a845","avatarUrl":"/avatars/863a5deebf2ac6d4faedc4dd368e0561.svg","isPro":false,"fullname":"Adhurim ","user":"Limi07","type":"user"},{"_id":"64d98ef7a4839890b25eb78b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d98ef7a4839890b25eb78b/215-CSVLl81z6CAq0ECWU.jpeg","isPro":true,"fullname":"Fangyuan Yu","user":"Ksgk-fy","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","isPro":false,"fullname":"Yury Panikov","user":"panikov","type":"user"},{"_id":"66f612b934b8ac9ffa44f084","avatarUrl":"/avatars/6836c122e19c66c90f1673f28b30d7f0.svg","isPro":false,"fullname":"Tang","user":"tommysally","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"65a4567e212d6aca9a3e8f5a","avatarUrl":"/avatars/ed944797230b5460381209bf76e4a0e4.svg","isPro":false,"fullname":"Catherine Liu","user":"Liu12uiL","type":"user"},{"_id":"64bbe9b236eb058cd9d6a5b9","avatarUrl":"/avatars/c7c01a3fa8809e73800392679abff6d5.svg","isPro":false,"fullname":"Kai Zuberbühler","user":"kaizuberbuehler","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2502.06428

CoS: Chain-of-Shot Prompting for Long Video Understanding

Published on Feb 10, 2025
· Submitted by
Jian Hu
on Feb 12, 2025
Authors:
,
,

Abstract

Chain-of-Shot prompting optimizes shot selection for long video understanding by aligning relevant shots with the video's semantic task, enhancing model performance without redundancy.

AI-generated summary

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

Community

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.06428 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.06428 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.06428 in a Space README.md to link it from this page.

Collections including this paper 4