ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
Abstract
A pose- and viewpoint-controllable human video generation method combines image generation with SMPL-X motion guidance and video diffusion models to produce high-quality, temporally consistent videos.
Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.
Community
This work introduces a method for human video generation with two main components and an additional bonus model. First, we generate high-quality images from front and back human reference inputs, controlling both viewpoint and pose using SMPL-X rendered normal maps. Second, a training-free temporal refinement module is employed to generate continuous frames, ensuring smooth and temporally consistent video synthesis. In addition, we provide an image model that accepts disentangled inputs for the face, clothing, and shoes, along with SMPL-X rendered normal maps, as a bonus feature for high-quality human video generation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision (2026)
- Novel View Synthesis as Video Completion (2026)
- Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion (2026)
- 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model (2026)
- Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades (2026)
- WildActor: Unconstrained Identity-Preserving Video Generation (2026)
- Human Video Generation from a Single Image with 3D Pose and View Control (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.19720 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper