MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
Ruijie Zhu1,2,
Jiahao Lu3,
Wenbo Hu2,
Xiaoguang Han4
Jianfei Cai5,
Ying Shan2,
Chuanxia Zheng1
1 NTU 2 ARC Lab, Tencent PCG 3 HKUST 4 CUHK(SZ) 5 Monash University
Model Description
MotionCrafter is a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos. It predicts dense point maps and scene flow for each frame within a shared world coordinate system, without requiring post-optimization.
Intended Use
- Research on 4D reconstruction and motion estimation from monocular videos
- Academic evaluation and benchmarking of dense point map and scene flow prediction
Not intended for safety-critical or real-time production use.
Limitations
- Performance can degrade with extreme motion blur or severe occlusion.
- Output quality is sensitive to input resolution and video quality.
- Generalization may be limited for out-of-domain scenes.
Training Data
Training data details and preprocessing are described in the paper and main repository. If you need dataset specifics, please refer to the project page and the paper.
Evaluation
Please refer to the paper for evaluation datasets, metrics, and results.
How to Use
import torch
from motioncrafter import (
MotionCrafterDiffPipeline,
MotionCrafterDetermPipeline,
UnifyAutoencoderKL,
UNetSpatioTemporalConditionModelVid2vid
)
unet_path = "TencentARC/MotionCrafter"
vae_path = "TencentARC/MotionCrafter"
model_type = "determ" # or "diff" for diffusion version
cache_dir = "./pretrained_models"
unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
unet_path,
subfolder='unet_diff' if model_type == 'diff' else 'unet_determ',
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float16)
geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
vae_path,
subfolder='geometry_motion_vae',
low_cpu_mem_usage=True,
torch_dtype=torch.float32,
cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float32)
if model_type == 'diff':
pipe = MotionCrafterDiffPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
unet=unet,
torch_dtype=torch.float16,
variant="fp16",
cache_dir=cache_dir
).to("cuda")
else:
pipe = MotionCrafterDetermPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
unet=unet,
torch_dtype=torch.float16,
variant="fp16",
cache_dir=cache_dir
).to("cuda")
Model Weights
- geometry_motion_vae/: 4D VAE for joint geometry and motion representation
- unet_determ/: deterministic UNet for motion prediction
Model Variants
- Deterministic (unet_determ): fast inference with fixed predictions per input
- Diffusion (unet_diff): probabilistic predictions with diverse outputs
Citation
@article{zhu2025motioncrafter,
title={MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE},
author={Zhu, Ruijie and Lu, Jiahao and Hu, Wenbo and Han, Xiaoguang and Cai, Jianfei and Shan, Ying and Zheng, Chuanxia},
journal={arXiv preprint arXiv:2602.08961},
year={2026}
}
License
This model is provided under the Tencent License. See LICENSE.txt for details.
Acknowledgments
This work builds upon GeometryCrafter. We thank the authors for their excellent contributions.