$\"截屏2026-02-13$

\n","updatedAt":"2026-02-13T05:53:58.698Z","author":{"_id":"633e570be7d5ce7bfe037a53","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633e570be7d5ce7bfe037a53/zV8ULv4Mu7YIGZ8D3JtmK.jpeg","fullname":"Zhaocheng Liu","name":"zhaocheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.44524770975112915},"editors":["zhaocheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/633e570be7d5ce7bfe037a53/zV8ULv4Mu7YIGZ8D3JtmK.jpeg"],"reactions":[{"reaction":"👍","users":["nick2k1"],"count":1}],"isReport":false}},{"id":"698fd2a99ce0e2fe491bf865","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-14T01:40:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model](https://huggingface.co/papers/2602.04913) (2026)\n* [Multi Agents Semantic Emotion Aligned Music to Image Generation with Music Derived Captions](https://huggingface.co/papers/2512.23320) (2025)\n* [SemanticAudio: Audio Generation and Editing in Semantic Space](https://huggingface.co/papers/2601.21402) (2026)\n* [Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding](https://huggingface.co/papers/2601.16449) (2026)\n* [AUHead: Realistic Emotional Talking Head Generation via Action Units Control](https://huggingface.co/papers/2602.09534) (2026)\n* [ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars](https://huggingface.co/papers/2512.19546) (2025)\n* [3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars](https://huggingface.co/papers/2602.10516) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-14T01:40:57.468Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7102293372154236},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.09070","authors":[{"_id":"698ebb55cace060ff123aec0","name":"Yufan Wen","hidden":false},{"_id":"698ebb55cace060ff123aec1","user":{"_id":"633e570be7d5ce7bfe037a53","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633e570be7d5ce7bfe037a53/zV8ULv4Mu7YIGZ8D3JtmK.jpeg","isPro":false,"fullname":"Zhaocheng Liu","user":"zhaocheng","type":"user"},"name":"Zhaocheng Liu","status":"claimed_verified","statusLastChangedAt":"2026-02-13T09:36:13.001Z","hidden":false},{"_id":"698ebb55cace060ff123aec2","name":"YeGuo Hua","hidden":false},{"_id":"698ebb55cace060ff123aec3","name":"Ziyi Guo","hidden":false},{"_id":"698ebb55cace060ff123aec4","name":"Lihua Zhang","hidden":false},{"_id":"698ebb55cace060ff123aec5","name":"Chun Yuan","hidden":false},{"_id":"698ebb55cace060ff123aec6","name":"Jian Wu","hidden":false}],"publishedAt":"2026-02-09T09:39:42.000Z","submittedOnDailyAt":"2026-02-13T03:23:58.688Z","title":"NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control","submittedOnDailyBy":{"_id":"633e570be7d5ce7bfe037a53","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633e570be7d5ce7bfe037a53/zV8ULv4Mu7YIGZ8D3JtmK.jpeg","isPro":false,"fullname":"Zhaocheng Liu","user":"zhaocheng","type":"user"},"summary":"Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a Global Semantic Anchor ensures stylistic stability, while a surgical Token-Level Affective Adapter modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.","upvotes":43,"discussionId":"698ebb55cace060ff123aec7","ai_summary":"NarraScore presents a hierarchical framework that uses frozen Vision-Language Models as affective sensors to generate coherent soundtracks for long-form videos by combining global semantic anchors with token-level adaptive modulation.","ai_keywords":["Vision-Language Models","Valence-Arousal trajectories","Dual-Branch Injection","Global Semantic Anchor","Token-Level Affective Adapter","residual injection","overfitting risks"],"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6690f300f84c9502c43ae845","avatarUrl":"/avatars/4e4c209a47680f77a3798c0769220022.svg","isPro":false,"fullname":"wyfspace","user":"wyfspace","type":"user"},{"_id":"67e3fcad80fa093541d53b38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/pOn_A_JM3IAVAE8jKfBr_.png","isPro":false,"fullname":"jiajunhe","user":"Allinonejj","type":"user"},{"_id":"6822e15466175107817c86f5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cbPkQbvyYTvs4bd9-Qp6o.png","isPro":false,"fullname":"Guo","user":"ziyi94","type":"user"},{"_id":"68da4d10d8b96845c20e121d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Zqw0JqQM6Pd170MEjS7Te.png","isPro":false,"fullname":"Joewoo","user":"joewoo92","type":"user"},{"_id":"672719d0b6bec3e64bdecbe9","avatarUrl":"/avatars/2d81903205474abb35d0f9d5ae67c187.svg","isPro":false,"fullname":"Yeguo Hua","user":"hyg22","type":"user"},{"_id":"69675d8eedaab3881e2af372","avatarUrl":"/avatars/7bfe1b230e52233a8a467f17063acb47.svg","isPro":false,"fullname":"zhangsheng","user":"zzzs327","type":"user"},{"_id":"633e570be7d5ce7bfe037a53","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633e570be7d5ce7bfe037a53/zV8ULv4Mu7YIGZ8D3JtmK.jpeg","isPro":false,"fullname":"Zhaocheng Liu","user":"zhaocheng","type":"user"},{"_id":"691aebf28411a45dc9ec44f3","avatarUrl":"/avatars/b81db095d9e998bd40eacaf07d469ea3.svg","isPro":false,"fullname":"cccpxy","user":"aaapxy","type":"user"},{"_id":"6392a6291294124e8c4a018a","avatarUrl":"/avatars/b4194717a95098431aab9e0f1616f404.svg","isPro":false,"fullname":"Lin Junpeng","user":"LetsJumP","type":"user"},{"_id":"68cca38a1a5ad6019ca9e09e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/t77m6O_KlBizVc8tuuK-c.png","isPro":false,"fullname":"lihuazhang","user":"theannered","type":"user"},{"_id":"6675854966c4fa6d0cee4d50","avatarUrl":"/avatars/aa6041a97985078e82cc89bfbade9828.svg","isPro":false,"fullname":"xuanyang zhang","user":"xuanyangz","type":"user"},{"_id":"6984da94a1fae6d8205c74b8","avatarUrl":"/avatars/fb20b396650de0d09a9be27d6d3a4a1f.svg","isPro":false,"fullname":"Aung Min","user":"withen-teststilf","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}}">

Papers

arxiv:2602.09070

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Published on Feb 9

· Submitted by

Zhaocheng Liu on Feb 13

ByteDance

Upvote

Authors:

Zhaocheng Liu ,

Abstract

NarraScore presents a hierarchical framework that uses frozen Vision-Language Models as affective sensors to generate coherent soundtracks for long-form videos by combining global semantic anchors with token-level adaptive modulation.

AI-generated summary

Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a Global Semantic Anchor ensures stylistic stability, while a surgical Token-Level Affective Adapter modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.