UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

UniVidX is a unified multimodal video diffusion framework for versatile video generation and perception. It supports omni-directional conditional generation across multiple modalities by training a single model to handle different input-output mappings rather than one fixed task.

This repository hosts the released UniVidX checkpoints:

  • univid_intrinsic.safetensors: checkpoint for UniVid-Intrinsic, covering RGB, albedo, irradiance, and normal video modalities.
  • univid_alpha.safetensors: checkpoint for UniVid-Alpha, covering blended RGB video, alpha matte, foreground, and background modalities.

Links

Citation

If you find this work useful, please cite:

@article{chen2026unividx,
  title     = {UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors},
  author    = {Chen, Houyuan and Li, Hong and Kong, Xianghao and Zhu, Tianrui and Xu, Shaocong and Xiao, Weiqing and Guo, Yuwei and Ye, Chongjie and Zhang, Lvmin and Zhao, Hao and Rao, Anyi},
  journal   = {ACM Transactions on Graphics},
  volume    = {45},
  number    = {4},
  articleno = {51},
  year      = {2026},
  month     = jul,
  doi       = {10.1145/3811304},
  url       = {https://doi.org/10.1145/3811304}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for houyuanchen/UniVidX