Nemotron-3-Super-120B · NF4 4-bit
4-bit NF4 quantisation of nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Quantised end-to-end by NEO — an autonomous AI agent · March 2026
What is this?
This is a 4-bit NF4 quantisation of NVIDIA's Nemotron-3-Super-120B — a 120B parameter hybrid Mamba-2 + Sparse MoE model with a 262K token context window. The full BF16 model weighs 232 GB. This quantisation brings it down to 61.9 GB (3.7× compression) while maintaining near-lossless weight fidelity (codec SNR: 74.79 dB).
The entire quantisation — pipeline design, codec implementation, benchmarking, and this model card — was done autonomously by NEO.
Quantised by NEO
NEO is an autonomous AI agent that executes complex engineering tasks end-to-end — no hand-holding, no step-by-step instructions.
For this model, NEO was given a single goal: quantise Nemotron-3-Super-120B to NF4 4-bit and ship it to HuggingFace. Here's what NEO did, fully autonomously:
| Step | What NEO did |
|---|---|
| 1. Architecture | Designed the shard-by-shard streaming pipeline to keep peak disk usage under 6 GB |
| 2. Codec | Implemented the NF4 pack/unpack codec from scratch (block_size=64, per-block absmax scaling) |
| 3. Execution | Processed all 50 shards × 42,683 tensors in a single unattended run |
| 4. Benchmark | Ran full quality validation — 0 errors, 0 NaN, 0 Inf, SNR 74.79 dB across all shards |
| 5. Export | Generated model card, assembled repo structure, pushed to HuggingFace |
Model Details
| Property | Value |
|---|---|
| Base model | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 |
| Architecture | Hybrid Mamba-2 + Sparse MoE (nemotron_h) |
| Parameters | ~120B (512 experts, 22 active per token) |
| Context length | 262,144 tokens |
| Source dtype | BF16 — 232 GB |
| Quantised dtype | NF4 4-bit — 61.9 GB |
| Compression | 3.7× |
| Shards | 50 × safetensors (~1.2 GB each) |
| Block size | 64 elements |
| Quantised by | NEO |
Benchmark Results
All 50 shards benchmarked · 42,683 tensors · 826s · pipeline by NEO
| Metric | Value | Threshold | Status |
|---|---|---|---|
| Load errors | 0 | 0 | ✅ |
| NaN tensors | 0 | 0 | ✅ |
| Inf tensors | 0 | 0 | ✅ |
| Codec SNR (mean) | 74.79 dB | > 30 dB | ✅ Excellent |
| Codec SNR (min) | 74.64 dB | > 30 dB | ✅ |
| Mean weight norm | 28.2240 | — | — |
| Mean sparsity | 11.60% | — | — |
Codec SNR measures the round-trip fidelity of the NF4 pack/unpack codec. Above 40 dB = correctly implemented. Above 60 dB = near-lossless. At 74.79 dB, this quantisation sits comfortably in the near-lossless range. Note: codec SNR is not a substitute for task accuracy benchmarks (MMLU, HellaSwag, ARC).
Quick Start
Install
pip install transformers>=4.40 accelerate>=0.27 safetensors torch
Standard Inference (Multi-GPU / Large VRAM)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"daksh-neo/Nemotron-3-Super-120B-NF4-4bit",
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"daksh-neo/Nemotron-3-Super-120B-NF4-4bit",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Explain NF4 quantisation in one paragraph."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.95)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Low-VRAM Inference (≥ 6 GB GPU + ≥ 64 GB RAM)
NEO's streaming loader enables inference on consumer hardware via layer-by-layer CPU offloading:
from src.loader import load_nf4_model
from transformers import AutoTokenizer
import torch
model = load_nf4_model(
model_dir="path/to/quantized_nf4_4bit",
target_dtype=torch.float16,
max_gpu_gb=6.0, # cap GPU usage — adjust to your device
offload_folder="/tmp/offload", # disk offload when RAM is tight
)
tokenizer = AutoTokenizer.from_pretrained(
"path/to/quantized_nf4_4bit",
trust_remote_code=True,
)
Hardware Requirements
| Scenario | GPU VRAM | CPU RAM | Example hardware |
|---|---|---|---|
| Full GPU | 64+ GB | 32+ GB | 2× A6000, 1× H100 |
| Mixed GPU + CPU | 8+ GB | 64+ GB | RTX 3090 + 64 GB DDR5 |
| CPU only | — | 128+ GB | Slow but functional |
The 50-shard format means only a few shards (~2–3 GB) need to be resident in memory at a time.
Weight Format
This model uses a custom NF4 layout designed by NEO — not compatible with standard bitsandbytes loading. Each quantised weight tensor W is stored as three entries:
| Key | Dtype | Description |
|---|---|---|
W.nf4_packed |
uint8 | Two 4-bit NF4 indices packed per byte |
W.nf4_scales |
float16 | Per-block absmax scale factors (block_size=64) |
W.nf4_shape |
int64 | Original tensor dimensions |
Non-quantised tensors (layer norms, biases, SSM parameters) remain at BF16.
Load with src/loader.py from nemotron-quantkit:
from src.loader import NF4ShardLoader
loader = NF4ShardLoader("path/to/quantized_nf4_4bit", target_dtype=torch.float16)
for name, tensor in loader.iter_all_tensors():
print(name, tensor.shape, tensor.dtype)
Quantisation Pipeline
BF16 source · HuggingFace Hub · 232 GB
│
▼ Stream shard 1 of 50 (~4.7 GB)
│ NF4-quantise tensor-by-tensor [block_size=64, CPU]
│ Write quantised shard (~1.2 GB)
│ Evict source shard from cache
▼ Repeat × 50 shards
│
▼ NF4 output · 61.9 GB · 3.7× compression
- Peak disk during quantisation: ~6 GB
- No GPU required for quantisation
- Zero human intervention — designed and run by NEO
Limitations
- Codec SNR ≠ task accuracy. MMLU / HellaSwag / ARC benchmarks require ≥ 93 GB combined GPU+RAM and have not been run on this quantisation.
- Custom weight format. Standard
AutoModel.from_pretrainedwill not load weights without thesrc/loader.pyadapter from nemotron-quantkit. modeling_nemotron_h.pyrequired. Fetched automatically withtrust_remote_code=True, or copy manually from the base model repo.- MoE routing at FP16. Expert routing logits are not quantised.
Citation
@misc{nemotron-nf4-neo-2026,
title = {Nemotron-3-Super-120B NF4 4-bit},
author = {NEO, heyneo.so},
year = {2026},
url = {https://huggingface.co/daksh-neo/Nemotron-3-Super-120B-NF4-4bit}
}
@inproceedings{dettmers2023qlora,
title = {QLoRA: Efficient Finetuning of Quantized LLMs},
author = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
booktitle = {NeurIPS},
year = {2023}
}
License
Quantised weights inherit the source model licence: NVIDIA Open Model License. Quantisation code (nemotron-quantkit) is MIT licensed.
Quantised autonomously by NEO
NEO is an autonomous AI agent. Give it a goal. It ships.
- Downloads last month
- 346
