Nemotron-Cascade-2-30B-A3B — Q5_K_M GGUF

GGUF quantization of nvidia/Nemotron-Cascade-2-30B-A3B.

Architecture: Hybrid Attention + Mamba (SSM) + MoE — 30B total parameters, 3B active
Quantization: Q5_K_M (k-quant, mixed precision ~5 bpw)

Quantization commands

# Convert HF model to GGUF (bf16)
python llama.cpp/convert_hf_to_gguf.py \
  nvidia/Nemotron-Cascade-2-30B-A3B \
  --outfile nemotron-cascade-30b-bf16.gguf \
  --outtype bf16

# Quantize to Q5_K_M
llama-quantize nemotron-cascade-30b-bf16.gguf \
  nemotron-cascade-30b-Q5_K_M.gguf Q5_K_M

Usage

Load in LM Studio, llama.cpp, or any GGUF-compatible runtime.

Downloads last month: 757

GGUF

Model size

32B params

Architecture

nemotron_h_moe

Hardware compatibility

5-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AdrienBrault/Nemotron-Cascade-2-30B-A3B-Q5_K_M-GGUF

Base model

nvidia/Nemotron-Cascade-2-30B-A3B

Quantized

(31)

this model