Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
[go: Go Back, main page]

https://github.com/MCG-NJU/LongVPO

\n","updatedAt":"2026-02-05T08:05:33.996Z","author":{"_id":"62bec18e7e808565cc15610f","avatarUrl":"/avatars/78e8b98120a61bf90c43bd8c8ea8d375.svg","fullname":"Zhenpeng Huang","name":"hzp","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5873067378997803},"editors":["hzp"],"editorAvatarUrls":["/avatars/78e8b98120a61bf90c43bd8c8ea8d375.svg"],"reactions":[],"isReport":false}},{"id":"6985471239487820aac210ce","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-06T01:42:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning](https://huggingface.co/papers/2601.15724) (2026)\n* [VideoWeave: A Data-Centric Approach for Efficient Video Understanding](https://huggingface.co/papers/2601.06309) (2026)\n* [Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning](https://huggingface.co/papers/2601.23224) (2026)\n* [Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding](https://huggingface.co/papers/2601.07761) (2026)\n* [A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos](https://huggingface.co/papers/2512.16978) (2025)\n* [CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models](https://huggingface.co/papers/2601.04778) (2026)\n* [What Happens Next? Next Scene Prediction with a Unified Video Model](https://huggingface.co/papers/2512.13015) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-06T01:42:42.419Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6997690796852112},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"69874f5ae143f25e92749157","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-02-07T14:42:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/longvpo-from-anchored-cues-to-self-reasoning-for-long-form-video-preference-optimization-5000-41c172e7\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"

arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/longvpo-from-anchored-cues-to-self-reasoning-for-long-form-video-preference-optimization-5000-41c172e7

\n
    \n
  • Executive Summary
  • \n
  • Detailed Breakdown
  • \n
  • Practical Applications
  • \n
\n","updatedAt":"2026-02-07T14:42:34.112Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6933493614196777},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.02341","authors":[{"_id":"69834c2437bca54cd0587d46","user":{"_id":"62bec18e7e808565cc15610f","avatarUrl":"/avatars/78e8b98120a61bf90c43bd8c8ea8d375.svg","isPro":false,"fullname":"Zhenpeng Huang","user":"hzp","type":"user"},"name":"Zhenpeng Huang","status":"claimed_verified","statusLastChangedAt":"2026-02-05T10:55:14.003Z","hidden":false},{"_id":"69834c2437bca54cd0587d47","name":"Jiaqi Li","hidden":false},{"_id":"69834c2437bca54cd0587d48","name":"Zihan Jia","hidden":false},{"_id":"69834c2437bca54cd0587d49","name":"Xinhao Li","hidden":false},{"_id":"69834c2437bca54cd0587d4a","name":"Desen Meng","hidden":false},{"_id":"69834c2437bca54cd0587d4b","name":"Lingxue Song","hidden":false},{"_id":"69834c2437bca54cd0587d4c","name":"Xi Chen","hidden":false},{"_id":"69834c2437bca54cd0587d4d","name":"Liang Li","hidden":false},{"_id":"69834c2437bca54cd0587d4e","name":"Limin Wang","hidden":false}],"publishedAt":"2026-02-02T17:03:37.000Z","submittedOnDailyAt":"2026-02-05T05:35:33.988Z","title":"LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization","submittedOnDailyBy":{"_id":"62bec18e7e808565cc15610f","avatarUrl":"/avatars/78e8b98120a61bf90c43bd8c8ea8d375.svg","isPro":false,"fullname":"Zhenpeng Huang","user":"hzp","type":"user"},"summary":"We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model's scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.","upvotes":1,"discussionId":"69834c2537bca54cd0587d4f","ai_summary":"LongVPO is a two-stage Direct Preference Optimization framework that enables short-context vision-language models to understand ultra-long videos through synthetic preference triples and recursive captioning, achieving state-of-the-art performance with minimal human annotation.","ai_keywords":["Direct Preference Optimization","vision-language models","ultra-long videos","preference triples","visual-similarity filtering","question-specificity filtering","positional bias","recursive captioning","multi-segment reasoning","large language model","scene-level metadata"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62bec18e7e808565cc15610f","avatarUrl":"/avatars/78e8b98120a61bf90c43bd8c8ea8d375.svg","isPro":false,"fullname":"Zhenpeng Huang","user":"hzp","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2602.02341

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

Published on Feb 2
· Submitted by
Zhenpeng Huang
on Feb 5
Authors:
,
,
,
,
,
,
,

Abstract

LongVPO is a two-stage Direct Preference Optimization framework that enables short-context vision-language models to understand ultra-long videos through synthetic preference triples and recursive captioning, achieving state-of-the-art performance with minimal human annotation.

AI-generated summary

We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model's scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.

Community

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/longvpo-from-anchored-cues-to-self-reasoning-for-long-form-video-preference-optimization-5000-41c172e7

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.02341 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.02341 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.