StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration
Abstract
StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage distillation and maintaining visual consistency through memory mechanisms.
Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.
Community
Audio-video multimodal generation
with real-time interactive experience
on a single GPU.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation (2026)
- Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation (2026)
- Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling (2026)
- Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation (2026)
- CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives (2026)
- StreamingEffect: Real-Time Human-Centric Video Effect Generation (2026)
- Native Audio-Visual Alignment for Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.25659 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper