Gemma 4 E4B-it NVFP4A16

By coolthor — writing about LLM infrastructure, quantization, and local AI deployment on DGX Spark.

First NVFP4 quantization of Google's Gemma 4 E4B-it model.

NVFP4A16 (4-bit weights, 16-bit activations) quantized using llm-compressor from the official BF16 checkpoint google/gemma-4-E4B-it.

Benchmark — NVIDIA DGX Spark (GB10)

Tested on DGX Spark (GB10, 128GB unified memory, 273 GB/s bandwidth):

Format	GPU Memory	tok/s	Relative
BF16	15.0 GB	19.2	1.0x
FP8 (online)	11.4 GB	36.0	1.9x
NVFP4A16	9.8 GB	49.9	2.6x
NVFP4 W4A4	10.2 GB	39.2	2.0x

3 runs x 500 tokens, +/-0.1 tok/s variance
Long output (1000 tokens): 49.8 tok/s, no degradation
Concurrent (3 parallel): 52.7 tok/s per request, 158 tok/s aggregate
No repetition degradation observed

Why W4A16 beats W4A4 on GB10

GB10 (SM121) lacks native FP4 activation compute. W4A4 falls back to Marlin dequantization for activations, adding overhead that negates the bandwidth savings. On GPUs with native FP4 compute (B200/B100), W4A4 should be faster. On GB10, W4A16 is the optimal format.

Known Performance Limitation: Triton Attention Fallback

This affects ALL E4B configurations on vLLM (BF16, FP8, NVFP4), not just this checkpoint.

Gemma 4 E4B has heterogeneous attention head dimensions — sliding attention uses head_dim=256 while global attention uses head_dim=512. This forces vLLM to disable FlashAttention and fall back to the significantly slower Triton attention kernel (vLLM issue #38887).

Impact: On RTX 4090, E4B runs at ~9 tok/s vs ~100+ tok/s for a similar-sized Llama 3B — an 11x gap attributable to the attention backend. On GB10, the impact is less extreme but still significant.

All benchmark numbers above are measured WITH this Triton fallback. Once PR #38891 (per-layer attention backend selection) is merged, expect a substantial speed improvement across all formats — potentially 30-50% faster.

Track the fix:

Issue: vllm-project/vllm#38887
PR: vllm-project/vllm#38891

About the Model

Gemma 4 E4B uses Per-Layer Embedding (PLE) architecture — not MoE, not traditional dense:

Effective parameters: 4B (compute per token)
Total parameters: 8B (including PLE embedding tables)
PLE embedding tables (~5.4 GB in BF16) are large but only used for fast lookups — not quantized
42 decoder layers (36 sliding attention + 6 global attention)
Supports: text, vision, audio, 128K context, tool calling

PLE Bandwidth Impact

The PLE tables account for ~36% of the model but consume disproportionate bandwidth because they are accessed every token and remain in BF16. This is why E4B (50 tok/s) is slightly slower than the 26B-A4B MoE (52 tok/s) despite being a much smaller model — MoE only reads 3.8B active parameters per token, while E4B reads 4B compute weights plus 5.4 GB of PLE lookups.

Quantization Details

Method: NVFP4A16 (weight-only, 4-bit FP, 16-bit activations)
Tool: llm-compressor v0.10.1.dev (git main, post-PR #2561)
Data-free: No calibration dataset required
Ignored layers: lm_head, vision_tower, audio_tower, embed_vision, embed_audio

Recipe

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4A16",
    ignore=[
        "lm_head",
        "re:.*vision_tower.*",
        "re:.*audio_tower.*",
        "re:.*embed_vision.*",
        "re:.*embed_audio.*",
    ],
)
oneshot(model=model, recipe=recipe)

Installation Note (as of April 2026)

llm-compressor on PyPI (v0.10.0.1) does not support Gemma 4 due to a transformers version conflict. You must install from git main:

pip install 'git+https://github.com/vllm-project/llm-compressor.git@main'
pip install --force-reinstall --no-deps 'transformers>=5.5' 'huggingface_hub>=0.30'
pip install torchvision  # required for Gemma4VideoProcessor

Usage with vLLM

vllm serve coolthor/Gemma-4-E4B-it-NVFP4A16 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --max-model-len 16384

Or with Docker:

docker run -d \
  --gpus all --ipc host --shm-size 32gb \
  -p 8000:8000 \
  -v /path/to/model:/models/gemma4-e4b \
  vllm/vllm-openai:gemma4-cu130 \
  --model /models/gemma4-e4b \
  --served-model-name gemma-4-e4b \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.75 \
  --reasoning-parser gemma4 \
  --enable-auto-tool-choice --tool-call-parser pythonic

Quick Alternative: FP8 Online (no download needed)

If you just want a speed boost without downloading a quantized checkpoint:

vllm serve google/gemma-4-E4B-it \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 16384

This gives 36 tok/s (1.9x over BF16) with zero extra work.

Quality Checks

Test	Result
Chinese (Traditional)	Fluent, correct BPS explanation
Math/Reasoning	Put spread max profit/loss calculation correct
Code Generation	Black-Scholes implementation complete and correct
Structured Output (JSON)	Valid JSON format
Repetition Test (count 1-50)	No duplicates, no degradation

Hardware Tested

NVIDIA DGX Spark (GB10, SM121, 128GB unified memory, 273 GB/s)
vLLM v0.18.2rc1 (gemma4-cu130 image)

Model tree for coolthor/Gemma-4-E4B-it-NVFP4A16

Base model

google/gemma-4-E4B-it

Quantized

(44)

this model

coolthor
/

Gemma-4-E4B-it-NVFP4A16