Hermes-4-405B-NVFP4

NVFP4-quantized version of NousResearch/Hermes-4-405B, produced by Enfuse.

Model Overview

Attribute	Value
Base Model	NousResearch/Hermes-4-405B
Parameters	405B (dense)
Architecture	Dense Transformer (Llama 3.1 based)
Quantization	NVFP4 (W4A4 with FP4 weights and dynamic FP4 activations)
Format	`compressed-tensors` (safetensors)
Precision	FP4 weights (group_size=16), FP8 scales, lm_head unquantized
Approx. Size	~219 GB (down from ~812 GB in BF16)
Context Length	131,072 tokens
License	Llama 3 Community License

Why This Model

Hermes-4-405B is one of the largest open-weight dense models available. At 812 GB in BF16, it requires specialized hardware just to load into memory. This NVFP4 quantization reduces it to ~219 GB, making it deployable on 2-4 B200 GPUs -- a significant reduction from the 6+ GPUs needed for the original.

The base Llama-3.1-405B-Instruct has an existing NVFP4 from NVIDIA. Hermes-4-405B is a distinct model with ~60B tokens of post-training data, hybrid reasoning capabilities, and different alignment characteristics.

How to Use

vLLM (recommended)

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "enfuse/Hermes-4-405B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id)

llm = LLM(model=model_id, tensor_parallel_size=2)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

messages = [
    {"role": "user", "content": "Explain the architecture of a 405B parameter language model."},
]

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Hardware Requirements

Full NVFP4 (W4A4): Requires NVIDIA Blackwell GPU (B200, GB200, RTX 5090)
Weight-only FP4: Older GPUs (H100, A100) can load the model but will only apply weight quantization
Recommended: 2x B200 with tensor parallelism (219 GB fits across 2x 183 GB GPUs, though 4x B200 recommended for headroom with KV cache)

Quantization Details

Quantized using LLM Compressor (v0.10.0):

Method: Post-training quantization (PTQ) with calibration
Calibration data: 512 samples from HuggingFaceH4/ultrachat_200k
Sequence length: 2048 tokens
Scheme: NVFP4
Excluded layers: lm_head

Infrastructure

Quantized on an NVIDIA DGX B200 (8x B200, 2 TiB RAM, CUDA 13.0). The 812 GB BF16 model required 6+ B200 GPUs to load, making this quantization infeasible on most hardware configurations.

Evaluation

OpenLLM v1 Benchmarks

Evaluated using lm-evaluation-harness (v0.4.11) with vLLM backend, --apply_chat_template --fewshot_as_multiturn, tensor_parallel_size=8 on NVIDIA B200 GPUs.

Benchmark	Metric	n-shot	NVFP4	BF16 Reference¹	Recovery
ARC-Challenge	acc_norm	25	75.17	—	—
GSM8K	exact_match	5	92.49	—	—
HellaSwag	acc_norm	10	88.49	—	—
MMLU	acc	5	83.19	87.2	95.4%
TruthfulQA MC2	acc	0	65.13	—	—
Winogrande	acc	5	81.22	—	—

¹ Hermes 4 Technical Report (arXiv:2508.18255v2) reports scores using lighteval with non-standard sampling (temp=0.6, top_p=0.95, top_k=20). Our evals use lm_eval with default settings. Direct comparison requires identical configurations; only MMLU is shown where a comparable reference exists.

Hermes-4-405B BF16 Reference Scores (Reasoning Mode)

Reference scores from the Hermes 4 Technical Report (Table 3, different eval methodology):

Benchmark	BF16 Score (Reasoning)	BF16 Score (Non-reasoning)
MMLU	87.2	73.6
MMLU-Pro	80.6	58.3
GPQA Diamond	70.6	39.4
BBH	86.3	68.7
IFEval	81.5	84.9
DROP	83.5	77.6
OpenBookQA	94.2	84.4

About Enfuse

Enfuse builds sovereign AI infrastructure for regulated enterprises. The Enfuse platform provides on-prem LLM orchestration and an App Factory for shipping governed, compliant AI applications on your own infrastructure.

This quantization is part of our ongoing work to make large language models more accessible and efficient for on-premise deployment, where memory efficiency directly impacts what models organizations can run within their own data centers.

Acknowledgments

NousResearch for the Hermes-4-405B model
vLLM Project for LLM Compressor
NVIDIA for the NVFP4 format and Blackwell hardware

Downloads last month: 41

Safetensors

Model size

230B params

Tensor type

F32

BF16

F8_E4M3

Model tree for enfuse/Hermes-4-405B-NVFP4

Base model

meta-llama/Llama-3.1-405B

Finetuned

NousResearch/Hermes-4-405B

Quantized

(11)

this model

Paper for enfuse/Hermes-4-405B-NVFP4

Hermes 4 Technical Report

Paper • 2508.18255 • Published Aug 25, 2025 • 47