Nemotron-3-Super-120B · NF4 4-bit

4-bit NF4 quantisation of nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Quantised end-to-end by NEO — an autonomous AI agent · March 2026

What is this?

This is a 4-bit NF4 quantisation of NVIDIA's Nemotron-3-Super-120B — a 120B parameter hybrid Mamba-2 + Sparse MoE model with a 262K token context window. The full BF16 model weighs 232 GB. This quantisation brings it down to 61.9 GB (3.7× compression) while maintaining near-lossless weight fidelity (codec SNR: 74.79 dB).

The entire quantisation — pipeline design, codec implementation, benchmarking, and this model card — was done autonomously by NEO.

Quantised by NEO

NEO is an autonomous AI agent that executes complex engineering tasks end-to-end — no hand-holding, no step-by-step instructions.

For this model, NEO was given a single goal: quantise Nemotron-3-Super-120B to NF4 4-bit and ship it to HuggingFace. Here's what NEO did, fully autonomously:

Step	What NEO did
1. Architecture	Designed the shard-by-shard streaming pipeline to keep peak disk usage under 6 GB
2. Codec	Implemented the NF4 pack/unpack codec from scratch (block_size=64, per-block absmax scaling)
3. Execution	Processed all 50 shards × 42,683 tensors in a single unattended run
4. Benchmark	Ran full quality validation — 0 errors, 0 NaN, 0 Inf, SNR 74.79 dB across all shards
5. Export	Generated model card, assembled repo structure, pushed to HuggingFace

→ heyneo.so

Model Details

Property	Value
Base model	`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16`
Architecture	Hybrid Mamba-2 + Sparse MoE (`nemotron_h`)
Parameters	~120B (512 experts, 22 active per token)
Context length	262,144 tokens
Source dtype	BF16 — 232 GB
Quantised dtype	NF4 4-bit — 61.9 GB
Compression	3.7×
Shards	50 × safetensors (~1.2 GB each)
Block size	64 elements
Quantised by	NEO

Benchmark Results

All 50 shards benchmarked · 42,683 tensors · 826s · pipeline by NEO

Metric	Value	Threshold	Status
Load errors	0	0	✅
NaN tensors	0	0	✅
Inf tensors	0	0	✅
Codec SNR (mean)	74.79 dB	> 30 dB	✅ Excellent
Codec SNR (min)	74.64 dB	> 30 dB	✅
Mean weight norm	28.2240	—	—
Mean sparsity	11.60%	—	—

Codec SNR measures the round-trip fidelity of the NF4 pack/unpack codec. Above 40 dB = correctly implemented. Above 60 dB = near-lossless. At 74.79 dB, this quantisation sits comfortably in the near-lossless range. Note: codec SNR is not a substitute for task accuracy benchmarks (MMLU, HellaSwag, ARC).

Quick Start

Install

pip install transformers>=4.40 accelerate>=0.27 safetensors torch

Standard Inference (Multi-GPU / Large VRAM)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "daksh-neo/Nemotron-3-Super-120B-NF4-4bit",
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "daksh-neo/Nemotron-3-Super-120B-NF4-4bit",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain NF4 quantisation in one paragraph."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.95)

print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Low-VRAM Inference (≥ 6 GB GPU + ≥ 64 GB RAM)

NEO's streaming loader enables inference on consumer hardware via layer-by-layer CPU offloading:

from src.loader import load_nf4_model
from transformers import AutoTokenizer
import torch

model = load_nf4_model(
    model_dir="path/to/quantized_nf4_4bit",
    target_dtype=torch.float16,
    max_gpu_gb=6.0,                # cap GPU usage — adjust to your device
    offload_folder="/tmp/offload", # disk offload when RAM is tight
)
tokenizer = AutoTokenizer.from_pretrained(
    "path/to/quantized_nf4_4bit",
    trust_remote_code=True,
)

Hardware Requirements

Scenario	GPU VRAM	CPU RAM	Example hardware
Full GPU	64+ GB	32+ GB	2× A6000, 1× H100
Mixed GPU + CPU	8+ GB	64+ GB	RTX 3090 + 64 GB DDR5
CPU only	—	128+ GB	Slow but functional

The 50-shard format means only a few shards (~2–3 GB) need to be resident in memory at a time.

Weight Format

This model uses a custom NF4 layout designed by NEO — not compatible with standard bitsandbytes loading. Each quantised weight tensor W is stored as three entries:

Key	Dtype	Description
`W.nf4_packed`	uint8	Two 4-bit NF4 indices packed per byte
`W.nf4_scales`	float16	Per-block absmax scale factors (block_size=64)
`W.nf4_shape`	int64	Original tensor dimensions

Non-quantised tensors (layer norms, biases, SSM parameters) remain at BF16.

Load with src/loader.py from nemotron-quantkit:

from src.loader import NF4ShardLoader

loader = NF4ShardLoader("path/to/quantized_nf4_4bit", target_dtype=torch.float16)
for name, tensor in loader.iter_all_tensors():
    print(name, tensor.shape, tensor.dtype)

Quantisation Pipeline

BF16 source · HuggingFace Hub · 232 GB
        │
        ▼  Stream shard 1 of 50  (~4.7 GB)
        │  NF4-quantise tensor-by-tensor   [block_size=64, CPU]
        │  Write quantised shard           (~1.2 GB)
        │  Evict source shard from cache
        ▼  Repeat × 50 shards
        │
        ▼  NF4 output · 61.9 GB · 3.7× compression

Peak disk during quantisation: ~6 GB
No GPU required for quantisation
Zero human intervention — designed and run by NEO

Limitations

Codec SNR ≠ task accuracy. MMLU / HellaSwag / ARC benchmarks require ≥ 93 GB combined GPU+RAM and have not been run on this quantisation.
Custom weight format. Standard AutoModel.from_pretrained will not load weights without the src/loader.py adapter from nemotron-quantkit.
modeling_nemotron_h.py required. Fetched automatically with trust_remote_code=True, or copy manually from the base model repo.
MoE routing at FP16. Expert routing logits are not quantised.

Citation

@misc{nemotron-nf4-neo-2026,
  title  = {Nemotron-3-Super-120B NF4 4-bit},
  author = {NEO, heyneo.so},
  year   = {2026},
  url    = {https://huggingface.co/daksh-neo/Nemotron-3-Super-120B-NF4-4bit}
}

@inproceedings{dettmers2023qlora,
  title     = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author    = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  booktitle = {NeurIPS},
  year      = {2023}
}

License

Quantised weights inherit the source model licence: NVIDIA Open Model License. Quantisation code (nemotron-quantkit) is MIT licensed.

Quantised autonomously by NEO

NEO is an autonomous AI agent. Give it a goal. It ships.

Downloads last month: 346

Safetensors

Model size

126B params

Tensor type

I64

BF16

F16

Model tree for daksh-neo/Nemotron-3-Super-120B-NF4-4bit

Base model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Quantized

(42)

this model