Hermes-4-405B-NVFP4

NVFP4-quantized version of NousResearch/Hermes-4-405B, produced by Enfuse.

Model Overview

Attribute Value
Base Model NousResearch/Hermes-4-405B
Parameters 405B (dense)
Architecture Dense Transformer (Llama 3.1 based)
Quantization NVFP4 (W4A4 with FP4 weights and dynamic FP4 activations)
Format compressed-tensors (safetensors)
Precision FP4 weights (group_size=16), FP8 scales, lm_head unquantized
Approx. Size ~219 GB (down from ~812 GB in BF16)
Context Length 131,072 tokens
License Llama 3 Community License

Why This Model

Hermes-4-405B is one of the largest open-weight dense models available. At 812 GB in BF16, it requires specialized hardware just to load into memory. This NVFP4 quantization reduces it to ~219 GB, making it deployable on 2-4 B200 GPUs -- a significant reduction from the 6+ GPUs needed for the original.

The base Llama-3.1-405B-Instruct has an existing NVFP4 from NVIDIA. Hermes-4-405B is a distinct model with ~60B tokens of post-training data, hybrid reasoning capabilities, and different alignment characteristics.

How to Use

vLLM (recommended)

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "enfuse/Hermes-4-405B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id)

llm = LLM(model=model_id, tensor_parallel_size=2)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

messages = [
    {"role": "user", "content": "Explain the architecture of a 405B parameter language model."},
]

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Hardware Requirements

  • Full NVFP4 (W4A4): Requires NVIDIA Blackwell GPU (B200, GB200, RTX 5090)
  • Weight-only FP4: Older GPUs (H100, A100) can load the model but will only apply weight quantization
  • Recommended: 2x B200 with tensor parallelism (219 GB fits across 2x 183 GB GPUs, though 4x B200 recommended for headroom with KV cache)

Quantization Details

Quantized using LLM Compressor (v0.10.0):

  • Method: Post-training quantization (PTQ) with calibration
  • Calibration data: 512 samples from HuggingFaceH4/ultrachat_200k
  • Sequence length: 2048 tokens
  • Scheme: NVFP4
  • Excluded layers: lm_head

Infrastructure

Quantized on an NVIDIA DGX B200 (8x B200, 2 TiB RAM, CUDA 13.0). The 812 GB BF16 model required 6+ B200 GPUs to load, making this quantization infeasible on most hardware configurations.

Evaluation

OpenLLM v1 Benchmarks

Evaluated using lm-evaluation-harness (v0.4.11) with vLLM backend, --apply_chat_template --fewshot_as_multiturn, tensor_parallel_size=8 on NVIDIA B200 GPUs.

Benchmark Metric n-shot NVFP4 BF16 Reference¹ Recovery
ARC-Challenge acc_norm 25 75.17
GSM8K exact_match 5 92.49
HellaSwag acc_norm 10 88.49
MMLU acc 5 83.19 87.2 95.4%
TruthfulQA MC2 acc 0 65.13
Winogrande acc 5 81.22

¹ Hermes 4 Technical Report (arXiv:2508.18255v2) reports scores using lighteval with non-standard sampling (temp=0.6, top_p=0.95, top_k=20). Our evals use lm_eval with default settings. Direct comparison requires identical configurations; only MMLU is shown where a comparable reference exists.

Hermes-4-405B BF16 Reference Scores (Reasoning Mode)

Reference scores from the Hermes 4 Technical Report (Table 3, different eval methodology):

Benchmark BF16 Score (Reasoning) BF16 Score (Non-reasoning)
MMLU 87.2 73.6
MMLU-Pro 80.6 58.3
GPQA Diamond 70.6 39.4
BBH 86.3 68.7
IFEval 81.5 84.9
DROP 83.5 77.6
OpenBookQA 94.2 84.4

About Enfuse

Enfuse builds sovereign AI infrastructure for regulated enterprises. The Enfuse platform provides on-prem LLM orchestration and an App Factory for shipping governed, compliant AI applications on your own infrastructure.

This quantization is part of our ongoing work to make large language models more accessible and efficient for on-premise deployment, where memory efficiency directly impacts what models organizations can run within their own data centers.

Acknowledgments

Downloads last month
41
Safetensors
Model size
230B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enfuse/Hermes-4-405B-NVFP4

Quantized
(11)
this model

Paper for enfuse/Hermes-4-405B-NVFP4