NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control
Abstract
NarraScore presents a hierarchical framework that uses frozen Vision-Language Models as affective sensors to generate coherent soundtracks for long-form videos by combining global semantic anchors with token-level adaptive modulation.
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a Global Semantic Anchor ensures stylistic stability, while a surgical Token-Level Affective Adapter modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model (2026)
- Multi Agents Semantic Emotion Aligned Music to Image Generation with Music Derived Captions (2025)
- SemanticAudio: Audio Generation and Editing in Semantic Space (2026)
- Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding (2026)
- AUHead: Realistic Emotional Talking Head Generation via Action Units Control (2026)
- ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars (2025)
- 3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
