Gemma 4 E4B-it NVFP4A16

By coolthor — writing about LLM infrastructure, quantization, and local AI deployment on DGX Spark.

First NVFP4 quantization of Google's Gemma 4 E4B-it model.

NVFP4A16 (4-bit weights, 16-bit activations) quantized using llm-compressor from the official BF16 checkpoint google/gemma-4-E4B-it.

Benchmark — NVIDIA DGX Spark (GB10)

Tested on DGX Spark (GB10, 128GB unified memory, 273 GB/s bandwidth):

Format GPU Memory tok/s Relative
BF16 15.0 GB 19.2 1.0x
FP8 (online) 11.4 GB 36.0 1.9x
NVFP4A16 9.8 GB 49.9 2.6x
NVFP4 W4A4 10.2 GB 39.2 2.0x
  • 3 runs x 500 tokens, +/-0.1 tok/s variance
  • Long output (1000 tokens): 49.8 tok/s, no degradation
  • Concurrent (3 parallel): 52.7 tok/s per request, 158 tok/s aggregate
  • No repetition degradation observed

Why W4A16 beats W4A4 on GB10

GB10 (SM121) lacks native FP4 activation compute. W4A4 falls back to Marlin dequantization for activations, adding overhead that negates the bandwidth savings. On GPUs with native FP4 compute (B200/B100), W4A4 should be faster. On GB10, W4A16 is the optimal format.

Known Performance Limitation: Triton Attention Fallback

This affects ALL E4B configurations on vLLM (BF16, FP8, NVFP4), not just this checkpoint.

Gemma 4 E4B has heterogeneous attention head dimensions — sliding attention uses head_dim=256 while global attention uses head_dim=512. This forces vLLM to disable FlashAttention and fall back to the significantly slower Triton attention kernel (vLLM issue #38887).

Impact: On RTX 4090, E4B runs at ~9 tok/s vs ~100+ tok/s for a similar-sized Llama 3B — an 11x gap attributable to the attention backend. On GB10, the impact is less extreme but still significant.

All benchmark numbers above are measured WITH this Triton fallback. Once PR #38891 (per-layer attention backend selection) is merged, expect a substantial speed improvement across all formats — potentially 30-50% faster.

Track the fix:

About the Model

Gemma 4 E4B uses Per-Layer Embedding (PLE) architecture — not MoE, not traditional dense:

  • Effective parameters: 4B (compute per token)
  • Total parameters: 8B (including PLE embedding tables)
  • PLE embedding tables (~5.4 GB in BF16) are large but only used for fast lookups — not quantized
  • 42 decoder layers (36 sliding attention + 6 global attention)
  • Supports: text, vision, audio, 128K context, tool calling

PLE Bandwidth Impact

The PLE tables account for ~36% of the model but consume disproportionate bandwidth because they are accessed every token and remain in BF16. This is why E4B (50 tok/s) is slightly slower than the 26B-A4B MoE (52 tok/s) despite being a much smaller model — MoE only reads 3.8B active parameters per token, while E4B reads 4B compute weights plus 5.4 GB of PLE lookups.

Quantization Details

  • Method: NVFP4A16 (weight-only, 4-bit FP, 16-bit activations)
  • Tool: llm-compressor v0.10.1.dev (git main, post-PR #2561)
  • Data-free: No calibration dataset required
  • Ignored layers: lm_head, vision_tower, audio_tower, embed_vision, embed_audio

Recipe

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4A16",
    ignore=[
        "lm_head",
        "re:.*vision_tower.*",
        "re:.*audio_tower.*",
        "re:.*embed_vision.*",
        "re:.*embed_audio.*",
    ],
)
oneshot(model=model, recipe=recipe)

Installation Note (as of April 2026)

llm-compressor on PyPI (v0.10.0.1) does not support Gemma 4 due to a transformers version conflict. You must install from git main:

pip install 'git+https://github.com/vllm-project/llm-compressor.git@main'
pip install --force-reinstall --no-deps 'transformers>=5.5' 'huggingface_hub>=0.30'
pip install torchvision  # required for Gemma4VideoProcessor

Usage with vLLM

vllm serve coolthor/Gemma-4-E4B-it-NVFP4A16 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --max-model-len 16384

Or with Docker:

docker run -d \
  --gpus all --ipc host --shm-size 32gb \
  -p 8000:8000 \
  -v /path/to/model:/models/gemma4-e4b \
  vllm/vllm-openai:gemma4-cu130 \
  --model /models/gemma4-e4b \
  --served-model-name gemma-4-e4b \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.75 \
  --reasoning-parser gemma4 \
  --enable-auto-tool-choice --tool-call-parser pythonic

Quick Alternative: FP8 Online (no download needed)

If you just want a speed boost without downloading a quantized checkpoint:

vllm serve google/gemma-4-E4B-it \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 16384

This gives 36 tok/s (1.9x over BF16) with zero extra work.

Quality Checks

Test Result
Chinese (Traditional) Fluent, correct BPS explanation
Math/Reasoning Put spread max profit/loss calculation correct
Code Generation Black-Scholes implementation complete and correct
Structured Output (JSON) Valid JSON format
Repetition Test (count 1-50) No duplicates, no degradation

Hardware Tested

  • NVIDIA DGX Spark (GB10, SM121, 128GB unified memory, 273 GB/s)
  • vLLM v0.18.2rc1 (gemma4-cu130 image)

Links

Downloads last month
-
Safetensors
Model size
7B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for coolthor/Gemma-4-E4B-it-NVFP4A16

Quantized
(44)
this model