---
library_name: transformers
base_model: nvidia/Nemotron-Cascade-2-30B-A3B
base_model_relation: quantized
tags:
- fp4
- nvfp4
- quantized
- moe
- mamba
- hybrid
- nemotron
- nvidia
- modelopt
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
datasets:
- nvidia/Nemotron-Cascade-2-SFT-Data
---

# Nemotron-Cascade-2-30B-A3B-NVFP4

NVFP4 (4-bit) quantization of [Nemotron-Cascade-2-30B-A3B](https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B). Quantized with [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer) using the same selective recipe as [NVIDIA's official Nano NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4): MoE experts and Mamba GEMMs in NVFP4 (E2M1 with block scaling), attention and sensitive layers in BF16, KV cache in FP8. Native FP4 compute on Blackwell; weight-only dequant on Hopper/Ampere.

## Benchmarks

Calculated using [NVIDIA-NeMo/Evaluator](https://github.com/NVIDIA-NeMo/Evaluator) with config from [Nemotron-3-Super-120B's eval config](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator-launcher/examples/nemotron/nemotron-3-super/local_nemotron-3-super-120b-a12b_tools.yaml):

| Benchmark | [Nemotron-Cascade-2-30B-A3B](https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B)</br>(reproduced results) | **[Nemotron-Cascade-2-30B-A3B-NVFP4](https://huggingface.co/chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4)** </br>(this model) |
|---|---|---|
| AIME 2025 (avg@8) | 98.8 | **97.9** |
| AIME 2026 (avg@8) | 94.2 | **92.1** |
| HMMT Feb 2025 (avg@8) | 92.9 | **90.1** |

*With the low sample count (8 rollouts per problem), a deviation of ±2% accross runs is expected. The NVFP4 is consistently 1-2% below the original BF16.*

## Quantization Details

- **Method:** NVFP4 Post-Training Quantization (PTQ), without Quantization-Aware Distillation (QAD)
- **Format:** E2M1 (1 sign, 2 exponent, 1 mantissa bit) with hierarchical block scaling
- **Block scaling:** Group size 16 — each block of 16 values shares an FP8 E4M3 scale, plus a per-tensor FP32 global scale
- **KV cache:** FP8
- **Tooling:** [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer)

### Selective Quantization Recipe

Follows the Nano-architecture selective quantization recipe from the [Nemotron 3 Nano Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf) (Section 4). Same recipe as [NVIDIA's official NVFP4 checkpoint](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4). Sensitive components are kept in higher precision:

| Component | Precision | Rationale |
|---|---|---|
| MoE expert GEMMs (routed + shared) | **NVFP4** | All 23 MoE layers, 128 routed + 2 shared experts each |
| Mamba GEMMs (non-adjacent) | **NVFP4** | 17 of 23 Mamba layers |
| Attention layers (all 6) | **BF16** | Most sensitive — kept BF16 per NVIDIA sensitivity analysis |
| Mamba layers adjacent to attention (6) | **BF16** | Layers {4, 11, 18, 25, 32, 41} — found sensitive in ablations |
| Mamba 1D conv | **BF16** | All layers |
| Router gates | **FP32** | Routing precision must not degrade |
| Embeddings & lm_head | **BF16** | Not quantized |
| KV cache | **FP8** | All 6 attention layers |

### Calibration

- **Dataset:** 4,000 samples from [nvidia/Nemotron-Cascade-2-SFT-Data](https://huggingface.co/datasets/nvidia/Nemotron-Cascade-2-SFT-Data)
- **Domain mix:** math (1000), swe (900), terminal_agent (500), science (500), chat (400), conversational_agent (300), instruction_following (300), safety (100)
- **Sequence length:** Up to 12,288 tokens (no padding, natural length per sample)

## Usage

### SGLang

```bash
python -m sglang.launch_server \
    --model chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
    --trust-remote-code \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nano_v3
```

### vLLM

```bash
vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
    --mamba_ssm_cache_dtype float32 \
    --max-model-len 262144 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nemotron_v3 \
    --kv-cache-dtype fp8
```

## GPU Requirements

| Architecture | GPU Examples | FP4 Support |
|---|---|---|
| Blackwell (SM100+) | B200, RTX 5090 | Native W4A4 — full compute speedup |
| Hopper (SM90) | H100, L40S | Weight-only dequantization at runtime |
| Ampere (SM86) | RTX 3090, A100 | Not supported |

Native FP4 Tensor Core compute requires Blackwell GPUs. On older architectures, weights are stored in FP4 but dequantized to FP16/BF16 at runtime — you still get the VRAM savings but not the compute speedup.

## Acknowledgments

- Quantization recipe based on the [Nemotron 3 Nano Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf)
- Quantized with [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer)