YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning (ICLR 2026)

Jinhua Zhang, Wei Long, Minghao Han, Weiyi You, Shuhang Gu

⭐ If this work is helpful for you, please help star this repo. Thanks! 🤗

✨ Key Contributions

1️⃣ Efficiency Bottleneck: VAR exhibits scale and spatial redundancy, causing high GPU memory consumption.

2️⃣ Our Solution: The proposed method enables MVAR generation without relying on KV cache during inference, significantly reducing the memory footprint.

📚 Citation

Please cite our work if it is helpful for your research:

@article{zhang2025mvar,
  title={MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning},
  author={Zhang, Jinhua and Long, Wei and Han, Minghao and You, Weiyi and Gu, Shuhang},
  journal={arXiv preprint arXiv:2505.12742},
  year={2025}
}

📰 News

2026-02-05: 🧠 Codebase and Weights are now available.
2026-01-25: 🚀 MVAR is accepted by ICLR 2026.
2025-05-20: 📄 Our MVAR paper has been published on arXiv.

🛠️ Pipeline

MVAR introduces the Scale and Spatial Markovian Assumption:

Scale Markovian: Only adopts the adjacent preceding scale for next-scale prediction.
Spatial Markovian: Restricts the attention of each token to a localized neighborhood of size $k$ at corresponding positions on adjacent scales.

🥇 Results

MVAR achieves a 3.0× reduction in GPU memory footprint compared to VAR.

📊 Comparison of Quantitative Results: MVAR vs. VAR (Click to expand)

📈 ImageNet 256×256 Benchmark (Click to expand)

🧪 Ablation Study on Markovian Assumptions (Click to expand)

🦁 MVAR Model Zoo

We provide various MVAR models accessible via our Huggingface Repo.

📊 Model Performance & Weights

Model	FID ↓	IS ↑	sFID ↓	Prec. ↑	Recall ↑	Params	HF Weights 🤗
MVAR-d16	3.01	285.17	6.26	0.85	0.51	310M	link
MVAR-d16$^{*}$	3.37	295.35	6.10	0.86	0.48	310M	link
MVAR-d20$^{*}$	2.83	294.31	6.12	0.85	0.52	600M	link
MVAR-d24$^{*}$	2.15	298.85	5.62	0.84	0.56	1.0B	link

Note: $^{*}$ indicates models fine-tuned from VAR weights on ImageNet.

⚙️ Installation

Create conda environment:

conda create -n mvar python=3.11 -y
conda activate mvar

Install PyTorch and dependencies:

pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 \
    xformers==0.0.32.post2 \
    --index-url https://download.pytorch.org/whl/cu128

pip install accelerate einops tqdm huggingface_hub pytz tensorboard \
    transformers typed-argument-parser thop matplotlib seaborn wheel \
    scipy packaging ninja openxlab lmdb pillow

Install Neighborhood Attention:

You can also use the .whl file provided in HuggingFace
```
pip install natten-0.21.1+torch280cu128-cp311-cp311-linux_x86_64.whl
```

Prepare ImageNet dataset:

Click to view expected directory structure

/path/to/imagenet/:
    train/:
        n01440764/
        ...
    val/:
        n01440764/
        ...

🚀 Training & Evaluation

1.Requirements (Pre-trained VAR)

Before running MVAR, you must download the necessary VAR weight first:

You can use the huggingface-cli to download the entire model repository:

# Install huggingface_hub if you haven't
pip install huggingface_hub
# Download models to local directory
hf download FoundationVision/var --local-dir ./pretrained/FoundationVision/var

2.Download MVAR

# Download models to local directory
hf download CVLUESTC/MVAR --local-dir ./checkpoints

3.Flash-Attn and Xformers (Optional)

Install and compile flash-attn and xformers for faster attention computation. Our code will automatically use them if installed. See models/basic_mvar.py#L17-L48.

4.Caching VQ-VAE Latents and Code Index (Optional)

Given that our data augmentation consists of simple center cropping and random flipping, VQ-VAE latents and code indices can be pre-computed and saved to CACHED_PATH tto reduce computational overhead during MVAR training:

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 main_cache.py \
  --img_size 256 --data_path ${IMAGENET_PATH} \
  --cached_path ${CACHED_PATH}/train_cache_mvar \ # or ${CACHED_PATH}/val_cache_mvar 
  --train \ # specify train

5.Training Scripts

To train MVAR on ImageNet 256x256, you can use --use_cached=True to use the pre-computed cached latents and code index:

# Example for MVAR-d16
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=16 --bs=448 --ep=300 --fp16=1 --alng=1e-3 --wpe=0.1 \
  --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} 

# Example for MVAR-d16 (Fine-tuning)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=16 --bs=448 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
  --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True

# Example for MVAR-d20 (Fine-tuning)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=20 --bs=192 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
  --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True

  # Example for MVAR-d24 (Fine-tuning)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=24 --bs=448 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
  --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True

6.Sampling & FID Evaluation

6.1. Generate images:

python run_mvar_evaluate.py \
  --cfg 2.7 --top_p 0.99 --top_k 1200 --depth 16 \
  --mvar_ckpt ${MVAR_CKPT}

Suggested CFG for models:

d16: cfg=2.7, top_p=0.99, top_k=1200
d16$^{*}$: cfg=2.0, top_p=0.99, top_k=1200
d20$^{*}$: cfg=1.5, top_p=0.96, top_k=900
d24$^{*}$: cfg=1.4, top_p=0.96, top_k=900

6.2. Run evaluation:

We use the OpenAI's FID evaluation toolkit and reference ground truth npz file of 256x256 to evaluate FID, IS, Precision, and Recall.

python utils/evaluations/c2i/evaluator.py \
  --ref_batch VIRTUAL_imagenet256_labeled.npz \
  --sample_batch ${SAMPLE_BATCH}

Related Repositories

NATTEN, VAR

📩 Contact

If you have any questions, feel free to reach out at jinhua.zjh@gmail.com.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for CVLUESTC/MVAR

MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

Paper • 2505.12742 • Published May 19, 2025