Gemma 4 31B IT FP8 Dynamic

Production-ready offline FP8 checkpoint for vLLM — 47% less VRAM, 99% more KV cache, and 48% faster than BF16.

We searched for a usable offline FP8 checkpoint of Gemma 4 31B-it but couldn't find one that worked cleanly with vLLM. So we vibe-coded our own and are sharing it with the community.

This repository hosts an offline FP8 checkpoint derived from google/gemma-4-31B-it for vLLM serving. No on-the-fly quantization needed at startup.

Published by Largitdata Inc.

Note: This is a derived operational checkpoint, not an official Google release. The original model's license terms, safety guidance, and documentation remain authoritative.

📖 中文說明 / Chinese Version

Model Details

Base model: google/gemma-4-31B-it
Derived format: offline FP8 checkpoint for vLLM
Quantization tool: llmcompressor
Quantization method: FP8_DYNAMIC
Calibration data: None required (dynamic quantization)
Excluded weights:
- norm-class 1D tensors — excluded to avoid expected 2D linear weight validation errors during quantization
- re:.*router\.proj$ — router weights excluded to maintain compatibility with the Gemma4 vLLM loading path
Output directory name: gemma-4-31B-it-FP8-DYNAMIC
Primary serving target: vllm/vllm-openai:gemma4
Organization: Largitdata Inc.

Test Environment

GPU: NVIDIA H200 NVL (143 GB VRAM)
Runtime: vllm/vllm-openai:gemma4
KV cache dtype: fp8
max_model_len: 32768
gpu_memory_utilization: 0.65

Observed vLLM startup characteristics:

model weight loading: 7.44 s
model loading total: 8.96 s
torch.compile: 66.23 s
engine init: 108.24 s
total time to /v1/models ready: about 147 s

Observed runtime capacity:

max_num_batched_tokens = 8192
available KV cache memory: 55.22 GiB
GPU KV cache size: 120,624 tokens
maximum concurrency at 32,768 tokens/request: 11.57x

Serving Capacity Comparison

Metric	FP8 Dynamic	BF16 Baseline
Model loading memory	31.49 GiB	58.9 GiB
GPU KV cache size	120,624 tokens	60,752 tokens
Max concurrency @ 32K tokens/req	11.57x	5.83x
VRAM savings	47% less	—
KV cache gain	99% more	—

Basic Benchmark

Single-request warm benchmark against the OpenAI-compatible vLLM endpoint:

prompt tokens: 38
completion tokens: 256
temperature: 0

Metric	FP8 Dynamic	BF16 Baseline
Avg end-to-end latency	3.404 s	5.041 s
Avg completion throughput	75.20 tok/s	50.79 tok/s
Avg total throughput	86.36 tok/s	58.32 tok/s

These numbers are single-request warm-path measurements, not multi-client throughput tests.

Unlike smaller MoE variants where FP8 trades single-request speed for memory savings, the 31B dense FP8 variant is faster across the board — 48% higher completion throughput, 47% less VRAM, and nearly double the KV cache capacity.

Concurrent Throughput

Concurrency	FP8 Aggregate TPS	BF16 Aggregate TPS	FP8 Avg Latency	BF16 Avg Latency
2	148.59 tok/s	102.21 tok/s	3.444 s	5.008 s
4	292.35 tok/s	201.25 tok/s	3.493 s	5.087 s
8	571.22 tok/s	390.74 tok/s	3.580 s	5.233 s

The FP8 variant maintains its speed advantage under concurrent load, with ~46% higher aggregate throughput at all tested concurrency levels.

Accuracy Evaluation — MMLU-Pro

We ran MMLU-Pro (full set, 14 categories) comparing this FP8 checkpoint against its BF16 baseline on the same harness and prompt template.

Model	MMLU-Pro Accuracy	Wall clock
31B BF16 (baseline)	84.60%	90 min
31B FP8 Dynamic (this)	84.55%	~110 min
Δ (FP8 − BF16)	−0.05 pp	—

FP8 quantization cost is essentially zero — a 0.05 pp gap is well within benchmark noise. Our 31B BF16 number (84.60%) is ~0.6 pp below Google's reported 85.2%, consistent with harness/prompt differences rather than a quantization defect; BF16 and FP8 shift together.

For context, the sibling 26B MoE variant under identical conditions:

Model	MMLU-Pro
26B BF16	81.59%
26B FP8 Dynamic Norouter	81.33%
31B BF16	84.60%
31B FP8 Dynamic	84.55%

31B beats 26B by +3.00 pp overall, with the biggest gains in 26B's weak categories (law +6.8, history +6.3, philosophy +5.2). See also: LargitData/gemma-4-26b-a4b-it-fp8.

Usage

Example vLLM launch:

docker run -d \
  --name vllm-gemma4-31b-fp8 \
  --restart unless-stopped \
  --ipc=host \
  --shm-size 16G \
  --gpus all \
  -v /models \
  -p 8001:8000 \
  -e NVIDIA_VISIBLE_DEVICES=0 \
  vllm/vllm-openai:gemma4 \
  --model /models/gemma-4-31B-it-FP8-DYNAMIC \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.65 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --host 0.0.0.0 \
  --port 8000

Known Limitations

MMLU-Pro measured (see above); MT-Bench and other benchmarks not yet run. Community contributions are welcome.
Only tested on NVIDIA H200 NVL. Other GPUs (A100, H100) may require adjusting gpu-memory-utilization.
norm-class 1D tensors and router.proj weights are excluded from quantization for vLLM compatibility.

Intended Use

This artifact is intended for:

Operational vLLM deployment on H200-class hardware
Reproducible offline FP8 serving experiments
Environments where startup-time on-the-fly quantization is undesirable
Production inference with higher concurrency requirements

This artifact is not intended to replace the original base model documentation, safety guidance, or license terms.

License

This repository contains a derived checkpoint based on google/gemma-4-31B-it. Usage is subject to the Gemma Terms of Use.

Citation

If you use this artifact, please cite both the derived checkpoint and the upstream base model.

@misc{largitdata_gemma4_31b_it_fp8_dynamic_2026,
  title        = {Gemma 4 31B IT FP8 Dynamic},
  author       = {David Chiu},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/largitdata-inc/gemma-4-31b-it-fp8-dynamic}},
  note         = {Derived offline FP8 checkpoint from google/gemma-4-31B-it for vLLM serving, published by Largitdata Inc. \url{https://www.largitdata.com/}}
}

@misc{google_gemma4_31b_it,
  title        = {Gemma 4 31B IT},
  author       = {Google},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/google/gemma-4-31B-it}}
}

Disclaimer

Users are responsible for verifying license compatibility, downstream serving behavior, numerical quality, and safety characteristics for their own environment.

中文說明

給 vLLM 用的離線 FP8 checkpoint — 比 BF16 省 47% VRAM，KV cache 多 99%，速度快 48%。

我們在網路上找了一輪，沒有找到堪用的 Gemma 4 31B 離線 FP8 版本，索性自己 vibe coding 做了一版，貢獻給社群。

這個 Repo 提供從 google/gemma-4-31B-it 衍生出的離線 FP8 checkpoint，讓 vLLM 可以直接載入服務，不需要在啟動時執行 on-the-fly 量化。

由 Largitdata Inc. 發佈。

注意： 這是衍生的操作用 checkpoint，並非 Google 官方發佈。原始模型的授權條款、安全指引與文件仍以官方為準。

模型細節

基底模型： google/gemma-4-31B-it
格式： 離線 FP8 checkpoint，供 vLLM 使用
量化工具： llmcompressor
量化方式： FP8_DYNAMIC
校準資料： 不需要（動態量化）
排除的權重：
- norm 類一維 tensor — 避免量化驗證時產生 expected 2D linear weight 類錯誤
- re:.*router\.proj$ — router 權重，維持與 Gemma4 vLLM 載入路徑的相容性
主要部署目標： vllm/vllm-openai:gemma4

測試環境

GPU： NVIDIA H200 NVL（143 GB VRAM）
Runtime： vllm/vllm-openai:gemma4
KV cache dtype： fp8
max_model_len： 32768
gpu_memory_utilization： 0.65

啟動實測數據：

模型權重載入：7.44 s
模型載入總計：8.96 s
torch.compile：66.23 s
引擎初始化：108.24 s
/v1/models 就緒總時間：約 147 s

執行期容量：

max_num_batched_tokens = 8192
可用 KV cache 記憶體：55.22 GiB
GPU KV cache 大小：120,624 tokens
最大平行處理量（32,768 tokens/request）：11.57x

服務容量比較

指標	FP8 Dynamic	BF16 原版
模型載入記憶體	31.49 GiB	58.9 GiB
GPU KV cache 大小	120,624 tokens	60,752 tokens
最大平行處理量 @ 32K tokens/req	11.57x	5.83x
VRAM 節省	47%	—
KV cache 增加	99%	—

基礎效能測試

單請求暖機測試（OpenAI 相容 vLLM endpoint）：

prompt tokens：38
completion tokens：256
temperature：0

指標	FP8 Dynamic	BF16 原版
平均端到端延遲	3.404 s	5.041 s
平均 completion 吞吐量	75.20 tok/s	50.79 tok/s
平均總吞吐量	86.36 tok/s	58.32 tok/s

與較小的 MoE 模型不同，31B dense 模型的 FP8 版本在所有指標上都優於 BF16 — completion 吞吐量高 48%、VRAM 用量少 47%、KV cache 容量近乎翻倍。

併發吞吐量測試

併發數	FP8 聚合 TPS	BF16 聚合 TPS	FP8 平均延遲	BF16 平均延遲
2	148.59 tok/s	102.21 tok/s	3.444 s	5.008 s
4	292.35 tok/s	201.25 tok/s	3.493 s	5.087 s
8	571.22 tok/s	390.74 tok/s	3.580 s	5.233 s

FP8 版本在併發負載下持續保持速度優勢，各測試併發等級的聚合吞吐量均高出約 46%。

精度評估 — MMLU-Pro

使用相同 harness 與 prompt template，對 FP8 checkpoint 與 BF16 原版同時執行 MMLU-Pro（全集，14 類別）：

模型	MMLU-Pro 準確率	總執行時間
31B BF16（基準）	84.60%	90 min
31B FP8 Dynamic（本版本）	84.55%	~110 min
Δ (FP8 − BF16)	−0.05 pp	—

FP8 量化的代價基本為零 — 0.05 pp 的差距遠低於 benchmark 噪音。我們的 31B BF16 成績（84.60%）比 Google 官方的 85.2% 低約 0.6 pp，這是 harness / prompt 差異造成，非量化問題；BF16 與 FP8 兩者一起偏移。

作為對照，同條件下的 26B MoE 版本：

模型	MMLU-Pro
26B BF16	81.59%
26B FP8 Dynamic Norouter	81.33%
31B BF16	84.60%
31B FP8 Dynamic	84.55%

31B 整體領先 26B +3.00 pp，在 26B 的弱項類別增幅最大（law +6.8、history +6.3、philosophy +5.2）。另見：LargitData/gemma-4-26b-a4b-it-fp8。

使用方式

vLLM 啟動範例：

docker run -d \
  --name vllm-gemma4-31b-fp8 \
  --restart unless-stopped \
  --ipc=host \
  --shm-size 16G \
  --gpus all \
  -v /models \
  -p 8001:8000 \
  -e NVIDIA_VISIBLE_DEVICES=0 \
  vllm/vllm-openai:gemma4 \
  --model /models/gemma-4-31B-it-FP8-DYNAMIC \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.65 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --host 0.0.0.0 \
  --port 8000

已知限制

MMLU-Pro 已完成（見上），MT-Bench 等其他 benchmark 尚未執行，歡迎社群補充
僅在 H200 NVL 上實測，其他 GPU（如 A100、H100）可能需要調整 gpu-memory-utilization
norm 類一維 tensor 與 router.proj 權重被排除在量化範圍外以維持 vLLM 相容性

使用場景

此 checkpoint 適用於：

在 H200 等級硬體上以 vLLM 進行生產部署
可重現的離線 FP8 服務實驗
不希望在啟動時執行 on-the-fly 量化的環境
需要更高平行處理能力的生產推論場景

此 checkpoint 不取代原始基底模型的文件、安全指引或授權條款。

授權

此 Repo 包含基於 google/gemma-4-31B-it 的衍生 checkpoint，使用須遵守 Gemma 使用條款。

免責聲明

使用者需自行驗證授權相容性、下游服務行為、數值品質與安全特性。

Downloads last month: 828

Safetensors

Model size

31B params

Tensor type

BF16

F8_E4M3

Model tree for LargitData/gemma-4-31b-it-fp8

Base model

google/gemma-4-31B-it

Quantized

(81)

this model