Gemma 4 26B-A4B IT FP8 Dynamic Norouter
Production-ready offline FP8 checkpoint for vLLM — 47% less VRAM, 80% more concurrency vs BF16.
We searched for a usable offline FP8 checkpoint of Gemma 4 26B-A4B-it but couldn't find one that worked cleanly with vLLM. So we vibe-coded our own and are sharing it with the community.
This repository hosts an offline FP8 checkpoint derived from google/gemma-4-26B-A4B-it for vLLM serving. No on-the-fly quantization needed at startup.
Published by Largitdata Inc.
Note: This is a derived operational checkpoint, not an official Google release. The original model's license terms, safety guidance, and documentation remain authoritative.
Model Details
- Base model:
google/gemma-4-26B-A4B-it - Derived format: offline FP8 checkpoint for vLLM
- Quantization tool:
llmcompressor - Quantization method:
FP8_DYNAMIC - Calibration data: None required (dynamic quantization)
- Excluded weights:
norm-class 1D tensors — excluded to avoidexpected 2D linear weightvalidation errors during quantizationre:.*router\.proj$— MoE router weights excluded to maintain compatibility with the Gemma4 vLLM loading path
- Output directory name:
gemma-4-26B-A4B-it-FP8-DYNAMIC-NOROUTER - Primary serving target:
vllm/vllm-openai:gemma4 - Organization: Largitdata Inc.
Test Environment
- GPU:
NVIDIA H200 NVL(143 GB VRAM) - Runtime:
vllm/vllm-openai:gemma4 - KV cache dtype:
fp8 max_model_len:32768gpu_memory_utilization:0.55
Observed vLLM startup characteristics:
- model weight loading:
15.76 s - model loading total:
16.88 s torch.compile:56.98 s- engine init:
102.17 s - total time to
/v1/modelsready: about153 s
Observed runtime capacity:
max_num_batched_tokens = 8192- available KV cache memory:
46.37 GiB - GPU KV cache size:
405,184 tokens - maximum concurrency at
32,768tokens/request:38.87x
Serving Capacity Comparison
| Metric | FP8 Dynamic Norouter | BF16 Baseline |
|---|---|---|
| Model loading memory | 25.75 GiB | 48.5 GiB |
| GPU KV cache size | 405,184 tokens | 225,376 tokens |
| Max concurrency @ 32K tokens/req | 38.87x | 21.62x |
| VRAM savings | 47% less | — |
| KV cache gain | 80% more | — |
Basic Benchmark
Single-request warm benchmark against the OpenAI-compatible vLLM endpoint:
- prompt tokens:
38 - completion tokens:
256 - temperature:
0
| Metric | FP8 Dynamic Norouter | BF16 Baseline |
|---|---|---|
| Avg end-to-end latency | 1.629 s | 1.536 s |
| Avg completion throughput | 157.19 tok/s | 166.62 tok/s |
| Avg total throughput | 180.53 tok/s | 191.36 tok/s |
These numbers are single-request warm-path measurements, not multi-client throughput tests. In production multi-client scenarios, the FP8 variant's larger KV cache is expected to provide superior aggregate throughput.
BF16 is ~6% faster on single-request latency, but the FP8 variant uses 47% less VRAM and provides 80% more KV cache capacity. For production environments serving multiple concurrent users, the FP8 variant offers a better trade-off.
Accuracy Evaluation — MMLU-Pro
We ran MMLU-Pro (full set, 14 categories) comparing this FP8 checkpoint against its BF16 baseline on the same harness and prompt template.
| Model | MMLU-Pro Accuracy | Wall clock |
|---|---|---|
| 26B BF16 (baseline) | 81.59% | 65 min |
| 26B FP8 Dynamic Norouter (this) | 81.33% | 63 min |
| Δ (FP8 − BF16) | −0.26 pp | — |
FP8 quantization cost is essentially zero — the 0.26 pp gap sits inside benchmark noise. Our 26B BF16 number (81.59%) is ~1.0 pp below Google's reported 82.6%, consistent with harness/prompt differences rather than a quantization defect; both BF16 and FP8 shift together.
For context, we also evaluated the sibling 31B variant under identical conditions:
| Model | MMLU-Pro |
|---|---|
| 26B BF16 | 81.59% |
| 26B FP8 | 81.33% |
| 31B BF16 | 84.60% |
| 31B FP8 Dynamic | 84.55% |
See also: LargitData/gemma-4-31b-it-fp8.
Usage
Example vLLM launch:
docker run -d \
--name vllm-gemma4-26b-fp8-norouter \
--restart unless-stopped \
--ipc=host \
--shm-size 16G \
--gpus all \
-v /models \
-p 8001:8000 \
-e NVIDIA_VISIBLE_DEVICES=0 \
vllm/vllm-openai:gemma4 \
--model /models/gemma-4-26B-A4B-it-FP8-DYNAMIC-NOROUTER \
--trust-remote-code \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.55 \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--host 0.0.0.0 \
--port 8000
Known Limitations
- Single-request latency is ~6% higher than BF16 due to FP8 dequantization overhead.
- MMLU-Pro measured (see above); MT-Bench and other benchmarks not yet run. Community contributions are welcome.
- Only tested on NVIDIA H200 NVL. Other GPUs (A100, H100) may require adjusting
gpu-memory-utilization. - MoE router weights (
router.proj) andnorm-class 1D tensors are excluded from quantization for vLLM compatibility. No routing degradation has been observed, but systematic evaluation has not been performed.
Intended Use
This artifact is intended for:
- Operational vLLM deployment on H200-class hardware
- Reproducible offline FP8 serving experiments
- Environments where startup-time on-the-fly quantization is undesirable
- Production inference with higher concurrency requirements
This artifact is not intended to replace the original base model documentation, safety guidance, or license terms.
License
This repository contains a derived checkpoint based on google/gemma-4-26B-A4B-it. Usage is subject to the Gemma Terms of Use.
Citation
If you use this artifact, please cite both the derived checkpoint and the upstream base model.
@misc{largitdata_gemma4_26b_a4b_it_fp8_dynamic_norouter_2026,
title = {Gemma 4 26B-A4B IT FP8 Dynamic Norouter},
author = {David Chiu},
year = {2026},
howpublished = {\url{https://huggingface.co/largitdata-inc/gemma-4-26b-a4b-it-fp8-dynamic-norouter}},
note = {Derived offline FP8 checkpoint from google/gemma-4-26B-A4B-it for vLLM serving, published by Largitdata Inc. \url{https://www.largitdata.com/}}
}
@misc{google_gemma4_26b_a4b_it,
title = {Gemma 4 26B-A4B IT},
author = {Google},
year = {2026},
howpublished = {\url{https://huggingface.co/google/gemma-4-26B-A4B-it}}
}
Disclaimer
Users are responsible for verifying license compatibility, downstream serving behavior, numerical quality, and safety characteristics for their own environment.
中文說明
給 vLLM 用的離線 FP8 checkpoint — 比 BF16 省 47% VRAM,平行處理能力多 80%。
我們在網路上找了一輪,沒有找到堪用的 Gemma 4 26B 離線 FP8 版本,索性自己 vibe coding 做了一版,貢獻給社群。
這個 Repo 提供從 google/gemma-4-26B-A4B-it 衍生出的離線 FP8 checkpoint,讓 vLLM 可以直接載入服務,不需要在啟動時執行 on-the-fly 量化。
由 Largitdata Inc. 發佈。
注意: 這是衍生的操作用 checkpoint,並非 Google 官方發佈。原始模型的授權條款、安全指引與文件仍以官方為準。
模型細節
- 基底模型:
google/gemma-4-26B-A4B-it - 格式: 離線 FP8 checkpoint,供 vLLM 使用
- 量化工具:
llmcompressor - 量化方式:
FP8_DYNAMIC - 校準資料: 不需要(動態量化)
- 排除的權重:
norm類一維 tensor — 避免量化驗證時產生expected 2D linear weight類錯誤re:.*router\.proj$— MoE router 權重,維持與 Gemma4 vLLM 載入路徑的相容性
- 主要部署目標:
vllm/vllm-openai:gemma4
測試環境
- GPU:
NVIDIA H200 NVL(143 GB VRAM) - Runtime:
vllm/vllm-openai:gemma4 - KV cache dtype:
fp8 max_model_len:32768gpu_memory_utilization:0.55
啟動實測數據:
- 模型權重載入:
15.76 s - 模型載入總計:
16.88 s torch.compile:56.98 s- 引擎初始化:
102.17 s /v1/models就緒總時間:約153 s
執行期容量:
max_num_batched_tokens = 8192- 可用 KV cache 記憶體:
46.37 GiB - GPU KV cache 大小:
405,184 tokens - 最大平行處理量(
32,768tokens/request):38.87x
服務容量比較
| 指標 | FP8 Dynamic Norouter | BF16 原版 |
|---|---|---|
| 模型載入記憶體 | 25.75 GiB | 48.5 GiB |
| GPU KV cache 大小 | 405,184 tokens | 225,376 tokens |
| 最大平行處理量 @ 32K tokens/req | 38.87x | 21.62x |
| VRAM 節省 | 47% | — |
| KV cache 增加 | 80% | — |
基礎效能測試
單請求暖機測試(OpenAI 相容 vLLM endpoint):
- prompt tokens:
38 - completion tokens:
256 - temperature:
0
| 指標 | FP8 Dynamic Norouter | BF16 原版 |
|---|---|---|
| 平均端到端延遲 | 1.629 s | 1.536 s |
| 平均 completion 吞吐量 | 157.19 tok/s | 166.62 tok/s |
| 平均總吞吐量 | 180.53 tok/s | 191.36 tok/s |
以上為單請求暖機路徑測量值,非多用戶吞吐量測試。在生產環境多用戶場景下,FP8 版本更大的 KV cache 預期能提供更好的整體吞吐量。
結論:BF16 單請求略快(約 6%),但 FP8 版本 VRAM 用量減少 47%,可用 KV cache 增加 80%。需要同時服務多用戶的生產環境,FP8 版本更具優勢。
精度評估 — MMLU-Pro
使用相同 harness 與 prompt template,對 FP8 checkpoint 與 BF16 原版同時執行 MMLU-Pro(全集,14 類別):
| 模型 | MMLU-Pro 準確率 | 總執行時間 |
|---|---|---|
| 26B BF16(基準) | 81.59% | 65 min |
| 26B FP8 Dynamic Norouter(本版本) | 81.33% | 63 min |
| Δ (FP8 − BF16) | −0.26 pp | — |
FP8 量化的代價基本為零 — 0.26 pp 的差距落在 benchmark 噪音範圍內。我們的 26B BF16 成績(81.59%)比 Google 官方的 82.6% 低約 1.0 pp,這是 harness / prompt 差異造成,非量化問題;BF16 與 FP8 兩者一起偏移。
作為對照,同條件下 31B 版本結果:
| 模型 | MMLU-Pro |
|---|---|
| 26B BF16 | 81.59% |
| 26B FP8 | 81.33% |
| 31B BF16 | 84.60% |
| 31B FP8 Dynamic | 84.55% |
另見:LargitData/gemma-4-31b-it-fp8。
使用方式
vLLM 啟動範例:
docker run -d \
--name vllm-gemma4-26b-fp8-norouter \
--restart unless-stopped \
--ipc=host \
--shm-size 16G \
--gpus all \
-v /models \
-p 8001:8000 \
-e NVIDIA_VISIBLE_DEVICES=0 \
vllm/vllm-openai:gemma4 \
--model /models/gemma-4-26B-A4B-it-FP8-DYNAMIC-NOROUTER \
--trust-remote-code \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.55 \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--host 0.0.0.0 \
--port 8000
已知限制
- 單請求延遲比 BF16 高約 6%,主因為 FP8 dequantization 的額外開銷
- MMLU-Pro 已完成(見上),MT-Bench 等其他 benchmark 尚未執行,歡迎社群補充
- 僅在 H200 NVL 上實測,其他 GPU(如 A100、H100)可能需要調整
gpu-memory-utilization - MoE router 權重(
router.proj)與norm類一維 tensor 被排除在量化範圍外以維持 vLLM 相容性,目前未觀察到分流品質下降,但尚無系統性評估
使用場景
此 checkpoint 適用於:
- 在 H200 等級硬體上以 vLLM 進行生產部署
- 可重現的離線 FP8 服務實驗
- 不希望在啟動時執行 on-the-fly 量化的環境
- 需要更高平行處理能力的生產推論場景
此 checkpoint 不取代原始基底模型的文件、安全指引或授權條款。
授權
此 Repo 包含基於 google/gemma-4-26B-A4B-it 的衍生 checkpoint,使用須遵守 Gemma 使用條款。
免責聲明
使用者需自行驗證授權相容性、下游服務行為、數值品質與安全特性。
- Downloads last month
- 45
Model tree for LargitData/gemma-4-26b-a4b-it-fp8
Base model
google/gemma-4-26B-A4B-it