--- library_name: transformers base_model: nvidia/Nemotron-Cascade-2-30B-A3B base_model_relation: quantized tags: - fp4 - nvfp4 - quantized - moe - mamba - hybrid - nemotron - nvidia - modelopt license: other license_name: nvidia-open-model-license license_link: >- https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ datasets: - nvidia/Nemotron-Cascade-2-SFT-Data --- # Nemotron-Cascade-2-30B-A3B-NVFP4 NVFP4 (4-bit) quantization of [Nemotron-Cascade-2-30B-A3B](https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B). Quantized with [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer) using the same selective recipe as [NVIDIA's official Nano NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4): MoE experts and Mamba GEMMs in NVFP4 (E2M1 with block scaling), attention and sensitive layers in BF16, KV cache in FP8. Native FP4 compute on Blackwell; weight-only dequant on Hopper/Ampere. ## Benchmarks Calculated using [NVIDIA-NeMo/Evaluator](https://github.com/NVIDIA-NeMo/Evaluator) with config from [Nemotron-3-Super-120B's eval config](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator-launcher/examples/nemotron/nemotron-3-super/local_nemotron-3-super-120b-a12b_tools.yaml): | Benchmark | [Nemotron-Cascade-2-30B-A3B](https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B)
(reproduced results) | **[Nemotron-Cascade-2-30B-A3B-NVFP4](https://huggingface.co/chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4)**
(this model) | |---|---|---| | AIME 2025 (avg@8) | 98.8 | **97.9** | | AIME 2026 (avg@8) | 94.2 | **92.1** | | HMMT Feb 2025 (avg@8) | 92.9 | **90.1** | *With the low sample count (8 rollouts per problem), a deviation of ±2% accross runs is expected. The NVFP4 is consistently 1-2% below the original BF16.* ## Quantization Details - **Method:** NVFP4 Post-Training Quantization (PTQ), without Quantization-Aware Distillation (QAD) - **Format:** E2M1 (1 sign, 2 exponent, 1 mantissa bit) with hierarchical block scaling - **Block scaling:** Group size 16 — each block of 16 values shares an FP8 E4M3 scale, plus a per-tensor FP32 global scale - **KV cache:** FP8 - **Tooling:** [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer) ### Selective Quantization Recipe Follows the Nano-architecture selective quantization recipe from the [Nemotron 3 Nano Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf) (Section 4). Same recipe as [NVIDIA's official NVFP4 checkpoint](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4). Sensitive components are kept in higher precision: | Component | Precision | Rationale | |---|---|---| | MoE expert GEMMs (routed + shared) | **NVFP4** | All 23 MoE layers, 128 routed + 2 shared experts each | | Mamba GEMMs (non-adjacent) | **NVFP4** | 17 of 23 Mamba layers | | Attention layers (all 6) | **BF16** | Most sensitive — kept BF16 per NVIDIA sensitivity analysis | | Mamba layers adjacent to attention (6) | **BF16** | Layers {4, 11, 18, 25, 32, 41} — found sensitive in ablations | | Mamba 1D conv | **BF16** | All layers | | Router gates | **FP32** | Routing precision must not degrade | | Embeddings & lm_head | **BF16** | Not quantized | | KV cache | **FP8** | All 6 attention layers | ### Calibration - **Dataset:** 4,000 samples from [nvidia/Nemotron-Cascade-2-SFT-Data](https://huggingface.co/datasets/nvidia/Nemotron-Cascade-2-SFT-Data) - **Domain mix:** math (1000), swe (900), terminal_agent (500), science (500), chat (400), conversational_agent (300), instruction_following (300), safety (100) - **Sequence length:** Up to 12,288 tokens (no padding, natural length per sample) ## Usage ### SGLang ```bash python -m sglang.launch_server \ --model chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \ --trust-remote-code \ --tool-call-parser qwen3_coder \ --reasoning-parser nano_v3 ``` ### vLLM ```bash vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \ --mamba_ssm_cache_dtype float32 \ --max-model-len 262144 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser nemotron_v3 \ --kv-cache-dtype fp8 ``` ## GPU Requirements | Architecture | GPU Examples | FP4 Support | |---|---|---| | Blackwell (SM100+) | B200, RTX 5090 | Native W4A4 — full compute speedup | | Hopper (SM90) | H100, L40S | Weight-only dequantization at runtime | | Ampere (SM86) | RTX 3090, A100 | Not supported | Native FP4 Tensor Core compute requires Blackwell GPUs. On older architectures, weights are stored in FP4 but dequantized to FP16/BF16 at runtime — you still get the VRAM savings but not the compute speedup. ## Acknowledgments - Quantization recipe based on the [Nemotron 3 Nano Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf) - Quantized with [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer)