Credit: This is a GGUF quantization of 0xSero/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-REAP-50pct-draft, a REAP expert-pruned checkpoint derived from nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16. All credit for the base model goes to NVIDIA, and the REAP pruning work to 0xSero.
NVIDIA Nemotron-H 120B REAP 50% — GGUF
GGUF quantizations of the REAP 50%-pruned Nemotron-H 120B model for use with llama.cpp and compatible tools.
Available Quantizations
| File | Quant | Size | BPW |
|---|---|---|---|
Nemotron-H-120B-REAP-50pct-BF16.gguf |
BF16 | 128.6 GB | 16.01 |
Nemotron-H-120B-REAP-50pct-Q8_0.gguf |
Q8_0 | 68.4 GB | 8.52 |
Nemotron-H-120B-REAP-50pct-Q6_K.gguf |
Q6_K | 59.7 GB | 7.43 |
Nemotron-H-120B-REAP-50pct-Q4_K_M.gguf |
Q4_K_M | 45.4 GB | 5.65 |
Model Details
| Property | Value |
|---|---|
| Architecture | NemotronH (hybrid Mamba + MoE + Attention) |
| Total Blocks | 88 (40 Mamba, 40 MoE, 8 Attention) |
| Original Parameters | ~120B (64B after 50% expert pruning) |
| Experts per MoE Layer | 256 (pruned from 512) |
| Routed Experts per Token | 22 |
| Context Length | 262,144 tokens |
| Vocab Size | 131,072 |
Usage
# With llama.cpp
llama-cli -m Nemotron-H-120B-REAP-50pct-Q4_K_M.gguf -p "Hello" -n 128
# With ollama (create a Modelfile first)
ollama create nemotron-h-reap -f Modelfile
About REAP Pruning
This model was pruned using the REAP method (arXiv:2510.13999), which selectively removes 50% of MoE experts based on layerwise activation observations. This reduces memory footprint while preserving quality for the most commonly activated expert pathways.
- Source: 0xSero/reap-expert-swap
- Research funding: donate.sybilsolutions.ai
Draft Caveats
This is a draft derived checkpoint from the original author. Full serving benchmarks and quality evaluations have not been completed. Evaluate accordingly.
License
Distributed under the NVIDIA Open Model License. See the original model for full terms.
Quantized by
DJLougen using llama.cpp on DGX Spark
- Downloads last month
- 757
4-bit
8-bit
16-bit