Hermes-4-405B-NVFP4
NVFP4-quantized version of NousResearch/Hermes-4-405B, produced by Enfuse.
Model Overview
| Attribute | Value |
|---|---|
| Base Model | NousResearch/Hermes-4-405B |
| Parameters | 405B (dense) |
| Architecture | Dense Transformer (Llama 3.1 based) |
| Quantization | NVFP4 (W4A4 with FP4 weights and dynamic FP4 activations) |
| Format | compressed-tensors (safetensors) |
| Precision | FP4 weights (group_size=16), FP8 scales, lm_head unquantized |
| Approx. Size | ~219 GB (down from ~812 GB in BF16) |
| Context Length | 131,072 tokens |
| License | Llama 3 Community License |
Why This Model
Hermes-4-405B is one of the largest open-weight dense models available. At 812 GB in BF16, it requires specialized hardware just to load into memory. This NVFP4 quantization reduces it to ~219 GB, making it deployable on 2-4 B200 GPUs -- a significant reduction from the 6+ GPUs needed for the original.
The base Llama-3.1-405B-Instruct has an existing NVFP4 from NVIDIA. Hermes-4-405B is a distinct model with ~60B tokens of post-training data, hybrid reasoning capabilities, and different alignment characteristics.
How to Use
vLLM (recommended)
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "enfuse/Hermes-4-405B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(model=model_id, tensor_parallel_size=2)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
messages = [
{"role": "user", "content": "Explain the architecture of a 405B parameter language model."},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
Hardware Requirements
- Full NVFP4 (W4A4): Requires NVIDIA Blackwell GPU (B200, GB200, RTX 5090)
- Weight-only FP4: Older GPUs (H100, A100) can load the model but will only apply weight quantization
- Recommended: 2x B200 with tensor parallelism (219 GB fits across 2x 183 GB GPUs, though 4x B200 recommended for headroom with KV cache)
Quantization Details
Quantized using LLM Compressor (v0.10.0):
- Method: Post-training quantization (PTQ) with calibration
- Calibration data: 512 samples from HuggingFaceH4/ultrachat_200k
- Sequence length: 2048 tokens
- Scheme:
NVFP4 - Excluded layers:
lm_head
Infrastructure
Quantized on an NVIDIA DGX B200 (8x B200, 2 TiB RAM, CUDA 13.0). The 812 GB BF16 model required 6+ B200 GPUs to load, making this quantization infeasible on most hardware configurations.
Evaluation
OpenLLM v1 Benchmarks
Evaluated using lm-evaluation-harness (v0.4.11) with vLLM backend, --apply_chat_template --fewshot_as_multiturn, tensor_parallel_size=8 on NVIDIA B200 GPUs.
| Benchmark | Metric | n-shot | NVFP4 | BF16 Reference¹ | Recovery |
|---|---|---|---|---|---|
| ARC-Challenge | acc_norm | 25 | 75.17 | — | — |
| GSM8K | exact_match | 5 | 92.49 | — | — |
| HellaSwag | acc_norm | 10 | 88.49 | — | — |
| MMLU | acc | 5 | 83.19 | 87.2 | 95.4% |
| TruthfulQA MC2 | acc | 0 | 65.13 | — | — |
| Winogrande | acc | 5 | 81.22 | — | — |
¹ Hermes 4 Technical Report (arXiv:2508.18255v2) reports scores using lighteval with non-standard sampling (temp=0.6, top_p=0.95, top_k=20). Our evals use lm_eval with default settings. Direct comparison requires identical configurations; only MMLU is shown where a comparable reference exists.
Hermes-4-405B BF16 Reference Scores (Reasoning Mode)
Reference scores from the Hermes 4 Technical Report (Table 3, different eval methodology):
| Benchmark | BF16 Score (Reasoning) | BF16 Score (Non-reasoning) |
|---|---|---|
| MMLU | 87.2 | 73.6 |
| MMLU-Pro | 80.6 | 58.3 |
| GPQA Diamond | 70.6 | 39.4 |
| BBH | 86.3 | 68.7 |
| IFEval | 81.5 | 84.9 |
| DROP | 83.5 | 77.6 |
| OpenBookQA | 94.2 | 84.4 |
About Enfuse
Enfuse builds sovereign AI infrastructure for regulated enterprises. The Enfuse platform provides on-prem LLM orchestration and an App Factory for shipping governed, compliant AI applications on your own infrastructure.
This quantization is part of our ongoing work to make large language models more accessible and efficient for on-premise deployment, where memory efficiency directly impacts what models organizations can run within their own data centers.
Acknowledgments
- NousResearch for the Hermes-4-405B model
- vLLM Project for LLM Compressor
- NVIDIA for the NVFP4 format and Blackwell hardware
- Downloads last month
- 41