Gemma 4 31B IT FP8 Dynamic

Production-ready offline FP8 checkpoint for vLLM — 47% less VRAM, 99% more KV cache, and 48% faster than BF16.

We searched for a usable offline FP8 checkpoint of Gemma 4 31B-it but couldn't find one that worked cleanly with vLLM. So we vibe-coded our own and are sharing it with the community.

This repository hosts an offline FP8 checkpoint derived from google/gemma-4-31B-it for vLLM serving. No on-the-fly quantization needed at startup.

Published by Largitdata Inc.

Note: This is a derived operational checkpoint, not an official Google release. The original model's license terms, safety guidance, and documentation remain authoritative.

📖 中文說明 / Chinese Version


Model Details

  • Base model: google/gemma-4-31B-it
  • Derived format: offline FP8 checkpoint for vLLM
  • Quantization tool: llmcompressor
  • Quantization method: FP8_DYNAMIC
  • Calibration data: None required (dynamic quantization)
  • Excluded weights:
    • norm-class 1D tensors — excluded to avoid expected 2D linear weight validation errors during quantization
    • re:.*router\.proj$ — router weights excluded to maintain compatibility with the Gemma4 vLLM loading path
  • Output directory name: gemma-4-31B-it-FP8-DYNAMIC
  • Primary serving target: vllm/vllm-openai:gemma4
  • Organization: Largitdata Inc.

Test Environment

  • GPU: NVIDIA H200 NVL (143 GB VRAM)
  • Runtime: vllm/vllm-openai:gemma4
  • KV cache dtype: fp8
  • max_model_len: 32768
  • gpu_memory_utilization: 0.65

Observed vLLM startup characteristics:

  • model weight loading: 7.44 s
  • model loading total: 8.96 s
  • torch.compile: 66.23 s
  • engine init: 108.24 s
  • total time to /v1/models ready: about 147 s

Observed runtime capacity:

  • max_num_batched_tokens = 8192
  • available KV cache memory: 55.22 GiB
  • GPU KV cache size: 120,624 tokens
  • maximum concurrency at 32,768 tokens/request: 11.57x

Serving Capacity Comparison

Metric FP8 Dynamic BF16 Baseline
Model loading memory 31.49 GiB 58.9 GiB
GPU KV cache size 120,624 tokens 60,752 tokens
Max concurrency @ 32K tokens/req 11.57x 5.83x
VRAM savings 47% less
KV cache gain 99% more

Basic Benchmark

Single-request warm benchmark against the OpenAI-compatible vLLM endpoint:

  • prompt tokens: 38
  • completion tokens: 256
  • temperature: 0
Metric FP8 Dynamic BF16 Baseline
Avg end-to-end latency 3.404 s 5.041 s
Avg completion throughput 75.20 tok/s 50.79 tok/s
Avg total throughput 86.36 tok/s 58.32 tok/s

These numbers are single-request warm-path measurements, not multi-client throughput tests.

Unlike smaller MoE variants where FP8 trades single-request speed for memory savings, the 31B dense FP8 variant is faster across the board — 48% higher completion throughput, 47% less VRAM, and nearly double the KV cache capacity.

Concurrent Throughput

Concurrency FP8 Aggregate TPS BF16 Aggregate TPS FP8 Avg Latency BF16 Avg Latency
2 148.59 tok/s 102.21 tok/s 3.444 s 5.008 s
4 292.35 tok/s 201.25 tok/s 3.493 s 5.087 s
8 571.22 tok/s 390.74 tok/s 3.580 s 5.233 s

The FP8 variant maintains its speed advantage under concurrent load, with ~46% higher aggregate throughput at all tested concurrency levels.

Accuracy Evaluation — MMLU-Pro

We ran MMLU-Pro (full set, 14 categories) comparing this FP8 checkpoint against its BF16 baseline on the same harness and prompt template.

Model MMLU-Pro Accuracy Wall clock
31B BF16 (baseline) 84.60% 90 min
31B FP8 Dynamic (this) 84.55% ~110 min
Δ (FP8 − BF16) −0.05 pp

FP8 quantization cost is essentially zero — a 0.05 pp gap is well within benchmark noise. Our 31B BF16 number (84.60%) is ~0.6 pp below Google's reported 85.2%, consistent with harness/prompt differences rather than a quantization defect; BF16 and FP8 shift together.

For context, the sibling 26B MoE variant under identical conditions:

Model MMLU-Pro
26B BF16 81.59%
26B FP8 Dynamic Norouter 81.33%
31B BF16 84.60%
31B FP8 Dynamic 84.55%

31B beats 26B by +3.00 pp overall, with the biggest gains in 26B's weak categories (law +6.8, history +6.3, philosophy +5.2). See also: LargitData/gemma-4-26b-a4b-it-fp8.

Usage

Example vLLM launch:

docker run -d \
  --name vllm-gemma4-31b-fp8 \
  --restart unless-stopped \
  --ipc=host \
  --shm-size 16G \
  --gpus all \
  -v /models \
  -p 8001:8000 \
  -e NVIDIA_VISIBLE_DEVICES=0 \
  vllm/vllm-openai:gemma4 \
  --model /models/gemma-4-31B-it-FP8-DYNAMIC \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.65 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --host 0.0.0.0 \
  --port 8000

Known Limitations

  • MMLU-Pro measured (see above); MT-Bench and other benchmarks not yet run. Community contributions are welcome.
  • Only tested on NVIDIA H200 NVL. Other GPUs (A100, H100) may require adjusting gpu-memory-utilization.
  • norm-class 1D tensors and router.proj weights are excluded from quantization for vLLM compatibility.

Intended Use

This artifact is intended for:

  • Operational vLLM deployment on H200-class hardware
  • Reproducible offline FP8 serving experiments
  • Environments where startup-time on-the-fly quantization is undesirable
  • Production inference with higher concurrency requirements

This artifact is not intended to replace the original base model documentation, safety guidance, or license terms.

License

This repository contains a derived checkpoint based on google/gemma-4-31B-it. Usage is subject to the Gemma Terms of Use.

Citation

If you use this artifact, please cite both the derived checkpoint and the upstream base model.

@misc{largitdata_gemma4_31b_it_fp8_dynamic_2026,
  title        = {Gemma 4 31B IT FP8 Dynamic},
  author       = {David Chiu},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/largitdata-inc/gemma-4-31b-it-fp8-dynamic}},
  note         = {Derived offline FP8 checkpoint from google/gemma-4-31B-it for vLLM serving, published by Largitdata Inc. \url{https://www.largitdata.com/}}
}

@misc{google_gemma4_31b_it,
  title        = {Gemma 4 31B IT},
  author       = {Google},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/google/gemma-4-31B-it}}
}

Disclaimer

Users are responsible for verifying license compatibility, downstream serving behavior, numerical quality, and safety characteristics for their own environment.


中文說明

給 vLLM 用的離線 FP8 checkpoint — 比 BF16 省 47% VRAM,KV cache 多 99%,速度快 48%。

我們在網路上找了一輪,沒有找到堪用的 Gemma 4 31B 離線 FP8 版本,索性自己 vibe coding 做了一版,貢獻給社群。

這個 Repo 提供從 google/gemma-4-31B-it 衍生出的離線 FP8 checkpoint,讓 vLLM 可以直接載入服務,不需要在啟動時執行 on-the-fly 量化。

Largitdata Inc. 發佈。

注意: 這是衍生的操作用 checkpoint,並非 Google 官方發佈。原始模型的授權條款、安全指引與文件仍以官方為準。

模型細節

  • 基底模型: google/gemma-4-31B-it
  • 格式: 離線 FP8 checkpoint,供 vLLM 使用
  • 量化工具: llmcompressor
  • 量化方式: FP8_DYNAMIC
  • 校準資料: 不需要(動態量化)
  • 排除的權重:
    • norm 類一維 tensor — 避免量化驗證時產生 expected 2D linear weight 類錯誤
    • re:.*router\.proj$ — router 權重,維持與 Gemma4 vLLM 載入路徑的相容性
  • 主要部署目標: vllm/vllm-openai:gemma4

測試環境

  • GPU: NVIDIA H200 NVL143 GB VRAM
  • Runtime: vllm/vllm-openai:gemma4
  • KV cache dtype: fp8
  • max_model_len 32768
  • gpu_memory_utilization 0.65

啟動實測數據:

  • 模型權重載入:7.44 s
  • 模型載入總計:8.96 s
  • torch.compile66.23 s
  • 引擎初始化:108.24 s
  • /v1/models 就緒總時間:約 147 s

執行期容量:

  • max_num_batched_tokens = 8192
  • 可用 KV cache 記憶體:55.22 GiB
  • GPU KV cache 大小:120,624 tokens
  • 最大平行處理量(32,768 tokens/request):11.57x

服務容量比較

指標 FP8 Dynamic BF16 原版
模型載入記憶體 31.49 GiB 58.9 GiB
GPU KV cache 大小 120,624 tokens 60,752 tokens
最大平行處理量 @ 32K tokens/req 11.57x 5.83x
VRAM 節省 47%
KV cache 增加 99%

基礎效能測試

單請求暖機測試(OpenAI 相容 vLLM endpoint):

  • prompt tokens:38
  • completion tokens:256
  • temperature:0
指標 FP8 Dynamic BF16 原版
平均端到端延遲 3.404 s 5.041 s
平均 completion 吞吐量 75.20 tok/s 50.79 tok/s
平均總吞吐量 86.36 tok/s 58.32 tok/s

與較小的 MoE 模型不同,31B dense 模型的 FP8 版本在所有指標上都優於 BF16 — completion 吞吐量高 48%、VRAM 用量少 47%、KV cache 容量近乎翻倍。

併發吞吐量測試

併發數 FP8 聚合 TPS BF16 聚合 TPS FP8 平均延遲 BF16 平均延遲
2 148.59 tok/s 102.21 tok/s 3.444 s 5.008 s
4 292.35 tok/s 201.25 tok/s 3.493 s 5.087 s
8 571.22 tok/s 390.74 tok/s 3.580 s 5.233 s

FP8 版本在併發負載下持續保持速度優勢,各測試併發等級的聚合吞吐量均高出約 46%。

精度評估 — MMLU-Pro

使用相同 harness 與 prompt template,對 FP8 checkpoint 與 BF16 原版同時執行 MMLU-Pro(全集,14 類別):

模型 MMLU-Pro 準確率 總執行時間
31B BF16(基準) 84.60% 90 min
31B FP8 Dynamic(本版本) 84.55% ~110 min
Δ (FP8 − BF16) −0.05 pp

FP8 量化的代價基本為零 — 0.05 pp 的差距遠低於 benchmark 噪音。我們的 31B BF16 成績(84.60%)比 Google 官方的 85.2% 低約 0.6 pp,這是 harness / prompt 差異造成,非量化問題;BF16 與 FP8 兩者一起偏移。

作為對照,同條件下的 26B MoE 版本:

模型 MMLU-Pro
26B BF16 81.59%
26B FP8 Dynamic Norouter 81.33%
31B BF16 84.60%
31B FP8 Dynamic 84.55%

31B 整體領先 26B +3.00 pp,在 26B 的弱項類別增幅最大(law +6.8、history +6.3、philosophy +5.2)。另見:LargitData/gemma-4-26b-a4b-it-fp8

使用方式

vLLM 啟動範例:

docker run -d \
  --name vllm-gemma4-31b-fp8 \
  --restart unless-stopped \
  --ipc=host \
  --shm-size 16G \
  --gpus all \
  -v /models \
  -p 8001:8000 \
  -e NVIDIA_VISIBLE_DEVICES=0 \
  vllm/vllm-openai:gemma4 \
  --model /models/gemma-4-31B-it-FP8-DYNAMIC \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.65 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --host 0.0.0.0 \
  --port 8000

已知限制

  • MMLU-Pro 已完成(見上),MT-Bench 等其他 benchmark 尚未執行,歡迎社群補充
  • 僅在 H200 NVL 上實測,其他 GPU(如 A100、H100)可能需要調整 gpu-memory-utilization
  • norm 類一維 tensor 與 router.proj 權重被排除在量化範圍外以維持 vLLM 相容性

使用場景

此 checkpoint 適用於:

  • 在 H200 等級硬體上以 vLLM 進行生產部署
  • 可重現的離線 FP8 服務實驗
  • 不希望在啟動時執行 on-the-fly 量化的環境
  • 需要更高平行處理能力的生產推論場景

此 checkpoint 不取代原始基底模型的文件、安全指引或授權條款。

授權

此 Repo 包含基於 google/gemma-4-31B-it 的衍生 checkpoint,使用須遵守 Gemma 使用條款

免責聲明

使用者需自行驗證授權相容性、下游服務行為、數值品質與安全特性。

Downloads last month
828
Safetensors
Model size
31B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LargitData/gemma-4-31b-it-fp8

Quantized
(81)
this model