Clean-Context Autoregressive Video Diffusion β Checkpoints
Trained denoisers from What Matters in Clean-Context Autoregressive Video Diffusion. Each checkpoint is a Diffusion-Forcing video denoiser trained on 32-frame windows of 64Γ64 RGB; see the paper for the full setup and the code repository for training and evaluation.
Main checkpoints (16)
Four training configurations β DF (Diffusion Forcing baseline),
Mask-only (masked prefix loss), Clean-only (clean prefix), and
Full (clean prefix + masked loss) β crossed with two denoiser backbones
and two datasets. Naming: {backbone}_{dataset}_{config}.ckpt.
| File | Backbone | Dataset | Config |
|---|---|---|---|
unet_dmlab_df.ckpt |
U-Net | DMLab | DF |
unet_dmlab_mask_only.ckpt |
U-Net | DMLab | Mask-only |
unet_dmlab_clean_only.ckpt |
U-Net | DMLab | Clean-only |
unet_dmlab_full.ckpt |
U-Net | DMLab | Full |
unet_minecraft_df.ckpt |
U-Net | Minecraft | DF |
unet_minecraft_mask_only.ckpt |
U-Net | Minecraft | Mask-only |
unet_minecraft_clean_only.ckpt |
U-Net | Minecraft | Clean-only |
unet_minecraft_full.ckpt |
U-Net | Minecraft | Full |
dit_dmlab_df.ckpt |
DiT | DMLab | DF |
dit_dmlab_mask_only.ckpt |
DiT | DMLab | Mask-only |
dit_dmlab_clean_only.ckpt |
DiT | DMLab | Clean-only |
dit_dmlab_full.ckpt |
DiT | DMLab | Full |
dit_minecraft_df.ckpt |
DiT | Minecraft | DF |
dit_minecraft_mask_only.ckpt |
DiT | Minecraft | Mask-only |
dit_minecraft_clean_only.ckpt |
DiT | Minecraft | Clean-only |
dit_minecraft_full.ckpt |
DiT | Minecraft | Full |
All sixteen share the same optimiser, schedule, and diffusion settings (AdamW, lr 8Γ10β»β΅, 100k steps, batch 8, fp16, cosine schedule with K=1000, v-prediction; DDIM with 100 steps, Ξ·=0 at inference); only the architecture and the training configuration differ. The U-Net is the 3D-convolutional Diffusion-Forcing backbone (β18.65 M params); the DiT is a strictly frame-causal diffusion transformer (β18.84 M params).
Ablation checkpoints (ablation/)
The causal-GroupNorm ablation (paper Β§4 / Table 4): the U-Net trained on DMLab to a matched 30k-step budget, with the standard (leaky) temporal GroupNorm versus a frame-causal variant, under DF and Full.
| File | GroupNorm | Config |
|---|---|---|
ablation/unet_dmlab_leaky_df.ckpt |
leaky (standard) | DF |
ablation/unet_dmlab_leaky_full.ckpt |
leaky (standard) | Full |
ablation/unet_dmlab_framecausal_df.ckpt |
frame-causal | DF |
ablation/unet_dmlab_framecausal_full.ckpt |
frame-causal | Full |
Loading
These are PyTorch-Lightning checkpoints. Load them with the matching config from the code repository; the U-Net and DiT backbones and all training settings are specified there.