Gemma 4 31B IT FP8 Dynamic
Production-ready offline FP8 checkpoint for vLLM — 47% less VRAM, 99% more KV cache, and 48% faster than BF16.
We searched for a usable offline FP8 checkpoint of Gemma 4 31B-it but couldn't find one that worked cleanly with vLLM. So we vibe-coded our own and are sharing it with the community.
This repository hosts an offline FP8 checkpoint derived from google/gemma-4-31B-it for vLLM serving. No on-the-fly quantization needed at startup.
Published by Largitdata Inc.
Note: This is a derived operational checkpoint, not an official Google release. The original model's license terms, safety guidance, and documentation remain authoritative.
Model Details
- Base model:
google/gemma-4-31B-it - Derived format: offline FP8 checkpoint for vLLM
- Quantization tool:
llmcompressor - Quantization method:
FP8_DYNAMIC - Calibration data: None required (dynamic quantization)
- Excluded weights:
norm-class 1D tensors — excluded to avoidexpected 2D linear weightvalidation errors during quantizationre:.*router\.proj$— router weights excluded to maintain compatibility with the Gemma4 vLLM loading path
- Output directory name:
gemma-4-31B-it-FP8-DYNAMIC - Primary serving target:
vllm/vllm-openai:gemma4 - Organization: Largitdata Inc.
Test Environment
- GPU:
NVIDIA H200 NVL(143 GB VRAM) - Runtime:
vllm/vllm-openai:gemma4 - KV cache dtype:
fp8 max_model_len:32768gpu_memory_utilization:0.65
Observed vLLM startup characteristics:
- model weight loading:
7.44 s - model loading total:
8.96 s torch.compile:66.23 s- engine init:
108.24 s - total time to
/v1/modelsready: about147 s
Observed runtime capacity:
max_num_batched_tokens = 8192- available KV cache memory:
55.22 GiB - GPU KV cache size:
120,624 tokens - maximum concurrency at
32,768tokens/request:11.57x
Serving Capacity Comparison
| Metric | FP8 Dynamic | BF16 Baseline |
|---|---|---|
| Model loading memory | 31.49 GiB | 58.9 GiB |
| GPU KV cache size | 120,624 tokens | 60,752 tokens |
| Max concurrency @ 32K tokens/req | 11.57x | 5.83x |
| VRAM savings | 47% less | — |
| KV cache gain | 99% more | — |
Basic Benchmark
Single-request warm benchmark against the OpenAI-compatible vLLM endpoint:
- prompt tokens:
38 - completion tokens:
256 - temperature:
0
| Metric | FP8 Dynamic | BF16 Baseline |
|---|---|---|
| Avg end-to-end latency | 3.404 s | 5.041 s |
| Avg completion throughput | 75.20 tok/s | 50.79 tok/s |
| Avg total throughput | 86.36 tok/s | 58.32 tok/s |
These numbers are single-request warm-path measurements, not multi-client throughput tests.
Unlike smaller MoE variants where FP8 trades single-request speed for memory savings, the 31B dense FP8 variant is faster across the board — 48% higher completion throughput, 47% less VRAM, and nearly double the KV cache capacity.
Concurrent Throughput
| Concurrency | FP8 Aggregate TPS | BF16 Aggregate TPS | FP8 Avg Latency | BF16 Avg Latency |
|---|---|---|---|---|
| 2 | 148.59 tok/s | 102.21 tok/s | 3.444 s | 5.008 s |
| 4 | 292.35 tok/s | 201.25 tok/s | 3.493 s | 5.087 s |
| 8 | 571.22 tok/s | 390.74 tok/s | 3.580 s | 5.233 s |
The FP8 variant maintains its speed advantage under concurrent load, with ~46% higher aggregate throughput at all tested concurrency levels.
Accuracy Evaluation — MMLU-Pro
We ran MMLU-Pro (full set, 14 categories) comparing this FP8 checkpoint against its BF16 baseline on the same harness and prompt template.
| Model | MMLU-Pro Accuracy | Wall clock |
|---|---|---|
| 31B BF16 (baseline) | 84.60% | 90 min |
| 31B FP8 Dynamic (this) | 84.55% | ~110 min |
| Δ (FP8 − BF16) | −0.05 pp | — |
FP8 quantization cost is essentially zero — a 0.05 pp gap is well within benchmark noise. Our 31B BF16 number (84.60%) is ~0.6 pp below Google's reported 85.2%, consistent with harness/prompt differences rather than a quantization defect; BF16 and FP8 shift together.
For context, the sibling 26B MoE variant under identical conditions:
| Model | MMLU-Pro |
|---|---|
| 26B BF16 | 81.59% |
| 26B FP8 Dynamic Norouter | 81.33% |
| 31B BF16 | 84.60% |
| 31B FP8 Dynamic | 84.55% |
31B beats 26B by +3.00 pp overall, with the biggest gains in 26B's weak categories (law +6.8, history +6.3, philosophy +5.2). See also: LargitData/gemma-4-26b-a4b-it-fp8.
Usage
Example vLLM launch:
docker run -d \
--name vllm-gemma4-31b-fp8 \
--restart unless-stopped \
--ipc=host \
--shm-size 16G \
--gpus all \
-v /models \
-p 8001:8000 \
-e NVIDIA_VISIBLE_DEVICES=0 \
vllm/vllm-openai:gemma4 \
--model /models/gemma-4-31B-it-FP8-DYNAMIC \
--trust-remote-code \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.65 \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--host 0.0.0.0 \
--port 8000
Known Limitations
- MMLU-Pro measured (see above); MT-Bench and other benchmarks not yet run. Community contributions are welcome.
- Only tested on NVIDIA H200 NVL. Other GPUs (A100, H100) may require adjusting
gpu-memory-utilization. norm-class 1D tensors androuter.projweights are excluded from quantization for vLLM compatibility.
Intended Use
This artifact is intended for:
- Operational vLLM deployment on H200-class hardware
- Reproducible offline FP8 serving experiments
- Environments where startup-time on-the-fly quantization is undesirable
- Production inference with higher concurrency requirements
This artifact is not intended to replace the original base model documentation, safety guidance, or license terms.
License
This repository contains a derived checkpoint based on google/gemma-4-31B-it. Usage is subject to the Gemma Terms of Use.
Citation
If you use this artifact, please cite both the derived checkpoint and the upstream base model.
@misc{largitdata_gemma4_31b_it_fp8_dynamic_2026,
title = {Gemma 4 31B IT FP8 Dynamic},
author = {David Chiu},
year = {2026},
howpublished = {\url{https://huggingface.co/largitdata-inc/gemma-4-31b-it-fp8-dynamic}},
note = {Derived offline FP8 checkpoint from google/gemma-4-31B-it for vLLM serving, published by Largitdata Inc. \url{https://www.largitdata.com/}}
}
@misc{google_gemma4_31b_it,
title = {Gemma 4 31B IT},
author = {Google},
year = {2026},
howpublished = {\url{https://huggingface.co/google/gemma-4-31B-it}}
}
Disclaimer
Users are responsible for verifying license compatibility, downstream serving behavior, numerical quality, and safety characteristics for their own environment.
中文說明
給 vLLM 用的離線 FP8 checkpoint — 比 BF16 省 47% VRAM,KV cache 多 99%,速度快 48%。
我們在網路上找了一輪,沒有找到堪用的 Gemma 4 31B 離線 FP8 版本,索性自己 vibe coding 做了一版,貢獻給社群。
這個 Repo 提供從 google/gemma-4-31B-it 衍生出的離線 FP8 checkpoint,讓 vLLM 可以直接載入服務,不需要在啟動時執行 on-the-fly 量化。
由 Largitdata Inc. 發佈。
注意: 這是衍生的操作用 checkpoint,並非 Google 官方發佈。原始模型的授權條款、安全指引與文件仍以官方為準。
模型細節
- 基底模型:
google/gemma-4-31B-it - 格式: 離線 FP8 checkpoint,供 vLLM 使用
- 量化工具:
llmcompressor - 量化方式:
FP8_DYNAMIC - 校準資料: 不需要(動態量化)
- 排除的權重:
norm類一維 tensor — 避免量化驗證時產生expected 2D linear weight類錯誤re:.*router\.proj$— router 權重,維持與 Gemma4 vLLM 載入路徑的相容性
- 主要部署目標:
vllm/vllm-openai:gemma4
測試環境
- GPU:
NVIDIA H200 NVL(143 GB VRAM) - Runtime:
vllm/vllm-openai:gemma4 - KV cache dtype:
fp8 max_model_len:32768gpu_memory_utilization:0.65
啟動實測數據:
- 模型權重載入:
7.44 s - 模型載入總計:
8.96 s torch.compile:66.23 s- 引擎初始化:
108.24 s /v1/models就緒總時間:約147 s
執行期容量:
max_num_batched_tokens = 8192- 可用 KV cache 記憶體:
55.22 GiB - GPU KV cache 大小:
120,624 tokens - 最大平行處理量(
32,768tokens/request):11.57x
服務容量比較
| 指標 | FP8 Dynamic | BF16 原版 |
|---|---|---|
| 模型載入記憶體 | 31.49 GiB | 58.9 GiB |
| GPU KV cache 大小 | 120,624 tokens | 60,752 tokens |
| 最大平行處理量 @ 32K tokens/req | 11.57x | 5.83x |
| VRAM 節省 | 47% | — |
| KV cache 增加 | 99% | — |
基礎效能測試
單請求暖機測試(OpenAI 相容 vLLM endpoint):
- prompt tokens:
38 - completion tokens:
256 - temperature:
0
| 指標 | FP8 Dynamic | BF16 原版 |
|---|---|---|
| 平均端到端延遲 | 3.404 s | 5.041 s |
| 平均 completion 吞吐量 | 75.20 tok/s | 50.79 tok/s |
| 平均總吞吐量 | 86.36 tok/s | 58.32 tok/s |
與較小的 MoE 模型不同,31B dense 模型的 FP8 版本在所有指標上都優於 BF16 — completion 吞吐量高 48%、VRAM 用量少 47%、KV cache 容量近乎翻倍。
併發吞吐量測試
| 併發數 | FP8 聚合 TPS | BF16 聚合 TPS | FP8 平均延遲 | BF16 平均延遲 |
|---|---|---|---|---|
| 2 | 148.59 tok/s | 102.21 tok/s | 3.444 s | 5.008 s |
| 4 | 292.35 tok/s | 201.25 tok/s | 3.493 s | 5.087 s |
| 8 | 571.22 tok/s | 390.74 tok/s | 3.580 s | 5.233 s |
FP8 版本在併發負載下持續保持速度優勢,各測試併發等級的聚合吞吐量均高出約 46%。
精度評估 — MMLU-Pro
使用相同 harness 與 prompt template,對 FP8 checkpoint 與 BF16 原版同時執行 MMLU-Pro(全集,14 類別):
| 模型 | MMLU-Pro 準確率 | 總執行時間 |
|---|---|---|
| 31B BF16(基準) | 84.60% | 90 min |
| 31B FP8 Dynamic(本版本) | 84.55% | ~110 min |
| Δ (FP8 − BF16) | −0.05 pp | — |
FP8 量化的代價基本為零 — 0.05 pp 的差距遠低於 benchmark 噪音。我們的 31B BF16 成績(84.60%)比 Google 官方的 85.2% 低約 0.6 pp,這是 harness / prompt 差異造成,非量化問題;BF16 與 FP8 兩者一起偏移。
作為對照,同條件下的 26B MoE 版本:
| 模型 | MMLU-Pro |
|---|---|
| 26B BF16 | 81.59% |
| 26B FP8 Dynamic Norouter | 81.33% |
| 31B BF16 | 84.60% |
| 31B FP8 Dynamic | 84.55% |
31B 整體領先 26B +3.00 pp,在 26B 的弱項類別增幅最大(law +6.8、history +6.3、philosophy +5.2)。另見:LargitData/gemma-4-26b-a4b-it-fp8。
使用方式
vLLM 啟動範例:
docker run -d \
--name vllm-gemma4-31b-fp8 \
--restart unless-stopped \
--ipc=host \
--shm-size 16G \
--gpus all \
-v /models \
-p 8001:8000 \
-e NVIDIA_VISIBLE_DEVICES=0 \
vllm/vllm-openai:gemma4 \
--model /models/gemma-4-31B-it-FP8-DYNAMIC \
--trust-remote-code \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.65 \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--host 0.0.0.0 \
--port 8000
已知限制
- MMLU-Pro 已完成(見上),MT-Bench 等其他 benchmark 尚未執行,歡迎社群補充
- 僅在 H200 NVL 上實測,其他 GPU(如 A100、H100)可能需要調整
gpu-memory-utilization norm類一維 tensor 與router.proj權重被排除在量化範圍外以維持 vLLM 相容性
使用場景
此 checkpoint 適用於:
- 在 H200 等級硬體上以 vLLM 進行生產部署
- 可重現的離線 FP8 服務實驗
- 不希望在啟動時執行 on-the-fly 量化的環境
- 需要更高平行處理能力的生產推論場景
此 checkpoint 不取代原始基底模型的文件、安全指引或授權條款。
授權
此 Repo 包含基於 google/gemma-4-31B-it 的衍生 checkpoint,使用須遵守 Gemma 使用條款。
免責聲明
使用者需自行驗證授權相容性、下游服務行為、數值品質與安全特性。
- Downloads last month
- 828
Model tree for LargitData/gemma-4-31b-it-fp8
Base model
google/gemma-4-31B-it