MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning (ICLR 2026)
β If this work is helpful for you, please help star this repo. Thanks! π€
β¨ Key Contributions
1οΈβ£ Efficiency Bottleneck: VAR exhibits scale and spatial redundancy, causing high GPU memory consumption.
2οΈβ£ Our Solution: The proposed method enables MVAR generation without relying on KV cache during inference, significantly reducing the memory footprint.
π Contents
- π Citation
- π° News
- π οΈ Pipeline
- π₯ Results
- π¦ Model Zoo
- βοΈ Installation
- π Training & Evaluation
π Citation
Please cite our work if it is helpful for your research:
@article{zhang2025mvar,
title={MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning},
author={Zhang, Jinhua and Long, Wei and Han, Minghao and You, Weiyi and Gu, Shuhang},
journal={arXiv preprint arXiv:2505.12742},
year={2025}
}
π° News
- 2026-02-05: π§ Codebase and Weights are now available.
- 2026-01-25: π MVAR is accepted by ICLR 2026.
- 2025-05-20: π Our MVAR paper has been published on arXiv.
π οΈ Pipeline
MVAR introduces the Scale and Spatial Markovian Assumption:
- Scale Markovian: Only adopts the adjacent preceding scale for next-scale prediction.
- Spatial Markovian: Restricts the attention of each token to a localized neighborhood of size $k$ at corresponding positions on adjacent scales.
π₯ Results
MVAR achieves a 3.0Γ reduction in GPU memory footprint compared to VAR.
π Comparison of Quantitative Results: MVAR vs. VAR (Click to expand)
π ImageNet 256Γ256 Benchmark (Click to expand)
π§ͺ Ablation Study on Markovian Assumptions (Click to expand)
π¦ MVAR Model Zoo
We provide various MVAR models accessible via our Huggingface Repo.
π Model Performance & Weights
| Model | FID β | IS β | sFID β | Prec. β | Recall β | Params | HF Weights π€ |
|---|---|---|---|---|---|---|---|
| MVAR-d16 | 3.01 | 285.17 | 6.26 | 0.85 | 0.51 | 310M | link |
| MVAR-d16$^{*}$ | 3.37 | 295.35 | 6.10 | 0.86 | 0.48 | 310M | link |
| MVAR-d20$^{*}$ | 2.83 | 294.31 | 6.12 | 0.85 | 0.52 | 600M | link |
| MVAR-d24$^{*}$ | 2.15 | 298.85 | 5.62 | 0.84 | 0.56 | 1.0B | link |
Note: $^{*}$ indicates models fine-tuned from VAR weights on ImageNet.
βοΈ Installation
Create conda environment:
conda create -n mvar python=3.11 -y conda activate mvarInstall PyTorch and dependencies:
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 \ xformers==0.0.32.post2 \ --index-url https://download.pytorch.org/whl/cu128 pip install accelerate einops tqdm huggingface_hub pytz tensorboard \ transformers typed-argument-parser thop matplotlib seaborn wheel \ scipy packaging ninja openxlab lmdb pillowInstall Neighborhood Attention:
You can also use the .whl file provided in HuggingFace
pip install natten-0.21.1+torch280cu128-cp311-cp311-linux_x86_64.whlPrepare ImageNet dataset:
Click to view expected directory structure
/path/to/imagenet/: train/: n01440764/ ... val/: n01440764/ ...
π Training & Evaluation
1.Requirements (Pre-trained VAR)
Before running MVAR, you must download the necessary VAR weight first:
You can use the huggingface-cli to download the entire model repository:
# Install huggingface_hub if you haven't
pip install huggingface_hub
# Download models to local directory
hf download FoundationVision/var --local-dir ./pretrained/FoundationVision/var
2.Download MVAR
# Download models to local directory
hf download CVLUESTC/MVAR --local-dir ./checkpoints
3.Flash-Attn and Xformers (Optional)
Install and compile flash-attn and xformers for faster attention computation. Our code will automatically use them if installed. See models/basic_mvar.py#L17-L48.
4.Caching VQ-VAE Latents and Code Index (Optional)
Given that our data augmentation consists of simple center cropping and random flipping, VQ-VAE latents and code indices can be pre-computed and saved to CACHED_PATH tto reduce computational overhead during MVAR training:
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 main_cache.py \
--img_size 256 --data_path ${IMAGENET_PATH} \
--cached_path ${CACHED_PATH}/train_cache_mvar \ # or ${CACHED_PATH}/val_cache_mvar
--train \ # specify train
5.Training Scripts
To train MVAR on ImageNet 256x256, you can use --use_cached=True to use the pre-computed cached latents and code index:
# Example for MVAR-d16
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--depth=16 --bs=448 --ep=300 --fp16=1 --alng=1e-3 --wpe=0.1 \
--data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME}
# Example for MVAR-d16 (Fine-tuning)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--depth=16 --bs=448 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
--data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True
# Example for MVAR-d20 (Fine-tuning)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--depth=20 --bs=192 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
--data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True
# Example for MVAR-d24 (Fine-tuning)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--depth=24 --bs=448 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
--data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True
6.Sampling & FID Evaluation
6.1. Generate images:
python run_mvar_evaluate.py \
--cfg 2.7 --top_p 0.99 --top_k 1200 --depth 16 \
--mvar_ckpt ${MVAR_CKPT}
Suggested CFG for models:
- d16: cfg=2.7, top_p=0.99, top_k=1200
- d16$^{*}$: cfg=2.0, top_p=0.99, top_k=1200
- d20$^{*}$: cfg=1.5, top_p=0.96, top_k=900
- d24$^{*}$: cfg=1.4, top_p=0.96, top_k=900
6.2. Run evaluation:
We use the OpenAI's FID evaluation toolkit and reference ground truth npz file of 256x256 to evaluate FID, IS, Precision, and Recall.
python utils/evaluations/c2i/evaluator.py \
--ref_batch VIRTUAL_imagenet256_labeled.npz \
--sample_batch ${SAMPLE_BATCH}
Related Repositories
π© Contact
If you have any questions, feel free to reach out at jinhua.zjh@gmail.com.