Credit: This is a GGUF quantization of 0xSero/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-REAP-50pct-draft, a REAP expert-pruned checkpoint derived from nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16. All credit for the base model goes to NVIDIA, and the REAP pruning work to 0xSero.

NVIDIA Nemotron-H 120B REAP 50% — GGUF

GGUF quantizations of the REAP 50%-pruned Nemotron-H 120B model for use with llama.cpp and compatible tools.

Available Quantizations

File	Quant	Size	BPW
`Nemotron-H-120B-REAP-50pct-BF16.gguf`	BF16	128.6 GB	16.01
`Nemotron-H-120B-REAP-50pct-Q8_0.gguf`	Q8_0	68.4 GB	8.52
`Nemotron-H-120B-REAP-50pct-Q6_K.gguf`	Q6_K	59.7 GB	7.43
`Nemotron-H-120B-REAP-50pct-Q4_K_M.gguf`	Q4_K_M	45.4 GB	5.65

Model Details

Property	Value
Architecture	NemotronH (hybrid Mamba + MoE + Attention)
Total Blocks	88 (40 Mamba, 40 MoE, 8 Attention)
Original Parameters	~120B (64B after 50% expert pruning)
Experts per MoE Layer	256 (pruned from 512)
Routed Experts per Token	22
Context Length	262,144 tokens
Vocab Size	131,072

Usage

# With llama.cpp
llama-cli -m Nemotron-H-120B-REAP-50pct-Q4_K_M.gguf -p "Hello" -n 128

# With ollama (create a Modelfile first)
ollama create nemotron-h-reap -f Modelfile

About REAP Pruning

This model was pruned using the REAP method (arXiv:2510.13999), which selectively removes 50% of MoE experts based on layerwise activation observations. This reduces memory footprint while preserving quality for the most commonly activated expert pathways.

Source: 0xSero/reap-expert-swap
Research funding: donate.sybilsolutions.ai

Draft Caveats

This is a draft derived checkpoint from the original author. Full serving benchmarks and quality evaluations have not been completed. Evaluate accordingly.

License

Distributed under the NVIDIA Open Model License. See the original model for full terms.

Quantized by

DJLougen using llama.cpp on DGX Spark

Downloads last month: 757

GGUF

Model size

64B params

Architecture

nemotron_h_moe

Hardware compatibility

4-bit

8-bit

16-bit

Model tree for DJLougen/Nemotron-H-120B-REAP-50pct-GGUF

Base model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Finetuned

0xSero/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-REAP-50pct-draft

Quantized

(1)

this model

Paper for DJLougen/Nemotron-H-120B-REAP-50pct-GGUF

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19