arxiv:2604.19720

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Published on Apr 21

· Submitted by

taited on Apr 23

GAP-LAB

Upvote

Authors:

Abstract

A pose- and viewpoint-controllable human video generation method combines image generation with SMPL-X motion guidance and video diffusion models to produce high-quality, temporally consistent videos.

AI-generated summary

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

View arXiv page View PDF Project page GitHub 19 Add to collection

Community

taited

Paper submitter about 20 hours ago

This work introduces a method for human video generation with two main components and an additional bonus model. First, we generate high-quality images from front and back human reference inputs, controlling both viewpoint and pose using SMPL-X rendered normal maps. Second, a training-free temporal refinement module is employed to generate continuous frames, ensuring smooth and temporally consistent video synthesis. In addition, we provide an image model that accepts disentangled inputs for the face, clothing, and shoes, along with SMPL-X rendered normal maps, as a bonus feature for high-quality human video generation.

librarian-bot

about 2 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.19720

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.19720 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.19720 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.19720 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.