MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

Ruijie Zhu^1,2, Jiahao Lu³, Wenbo Hu², Xiaoguang Han⁴
Jianfei Cai⁵, Ying Shan², Chuanxia Zheng¹

¹ NTU ² ARC Lab, Tencent PCG ³ HKUST ⁴ CUHK(SZ) ⁵ Monash University

📄 Paper | 🌐 Project Page | 💻 Code | 📜 License

Model Description

MotionCrafter is a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos. It predicts dense point maps and scene flow for each frame within a shared world coordinate system, without requiring post-optimization.

Intended Use

Research on 4D reconstruction and motion estimation from monocular videos
Academic evaluation and benchmarking of dense point map and scene flow prediction

Not intended for safety-critical or real-time production use.

Limitations

Performance can degrade with extreme motion blur or severe occlusion.
Output quality is sensitive to input resolution and video quality.
Generalization may be limited for out-of-domain scenes.

Training Data

Training data details and preprocessing are described in the paper and main repository. If you need dataset specifics, please refer to the project page and the paper.

Evaluation

Please refer to the paper for evaluation datasets, metrics, and results.

How to Use

import torch
from motioncrafter import (
    MotionCrafterDiffPipeline,
    MotionCrafterDetermPipeline,
    UnifyAutoencoderKL,
    UNetSpatioTemporalConditionModelVid2vid
)

unet_path = "TencentARC/MotionCrafter"
vae_path = "TencentARC/MotionCrafter"
model_type = "determ"  # or "diff" for diffusion version
cache_dir = "./pretrained_models"

unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
    unet_path,
    subfolder='unet_diff' if model_type == 'diff' else 'unet_determ',
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float16)

geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
    vae_path,
    subfolder='geometry_motion_vae',
    low_cpu_mem_usage=True,
    torch_dtype=torch.float32,
    cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float32)

if model_type == 'diff':
    pipe = MotionCrafterDiffPipeline.from_pretrained(
        "stabilityai/stable-video-diffusion-img2vid-xt",
        unet=unet,
        torch_dtype=torch.float16,
        variant="fp16",
        cache_dir=cache_dir
    ).to("cuda")
else:
    pipe = MotionCrafterDetermPipeline.from_pretrained(
        "stabilityai/stable-video-diffusion-img2vid-xt",
        unet=unet,
        torch_dtype=torch.float16,
        variant="fp16",
        cache_dir=cache_dir
    ).to("cuda")

Model Weights

geometry_motion_vae/: 4D VAE for joint geometry and motion representation
unet_determ/: deterministic UNet for motion prediction

Model Variants

Deterministic (unet_determ): fast inference with fixed predictions per input
Diffusion (unet_diff): probabilistic predictions with diverse outputs

Citation

@article{zhu2025motioncrafter,
  title={MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE},
  author={Zhu, Ruijie and Lu, Jiahao and Hu, Wenbo and Han, Xiaoguang and Cai, Jianfei and Shan, Ying and Zheng, Chuanxia},
  journal={arXiv preprint arXiv:2602.08961},
  year={2026}
}