Gemma 4 E4B-it NVFP4A16
By coolthor — writing about LLM infrastructure, quantization, and local AI deployment on DGX Spark.
First NVFP4 quantization of Google's Gemma 4 E4B-it model.
NVFP4A16 (4-bit weights, 16-bit activations) quantized using llm-compressor from the official BF16 checkpoint google/gemma-4-E4B-it.
Benchmark — NVIDIA DGX Spark (GB10)
Tested on DGX Spark (GB10, 128GB unified memory, 273 GB/s bandwidth):
| Format | GPU Memory | tok/s | Relative |
|---|---|---|---|
| BF16 | 15.0 GB | 19.2 | 1.0x |
| FP8 (online) | 11.4 GB | 36.0 | 1.9x |
| NVFP4A16 | 9.8 GB | 49.9 | 2.6x |
| NVFP4 W4A4 | 10.2 GB | 39.2 | 2.0x |
- 3 runs x 500 tokens, +/-0.1 tok/s variance
- Long output (1000 tokens): 49.8 tok/s, no degradation
- Concurrent (3 parallel): 52.7 tok/s per request, 158 tok/s aggregate
- No repetition degradation observed
Why W4A16 beats W4A4 on GB10
GB10 (SM121) lacks native FP4 activation compute. W4A4 falls back to Marlin dequantization for activations, adding overhead that negates the bandwidth savings. On GPUs with native FP4 compute (B200/B100), W4A4 should be faster. On GB10, W4A16 is the optimal format.
Known Performance Limitation: Triton Attention Fallback
This affects ALL E4B configurations on vLLM (BF16, FP8, NVFP4), not just this checkpoint.
Gemma 4 E4B has heterogeneous attention head dimensions — sliding attention uses head_dim=256 while global attention uses head_dim=512. This forces vLLM to disable FlashAttention and fall back to the significantly slower Triton attention kernel (vLLM issue #38887).
Impact: On RTX 4090, E4B runs at ~9 tok/s vs ~100+ tok/s for a similar-sized Llama 3B — an 11x gap attributable to the attention backend. On GB10, the impact is less extreme but still significant.
All benchmark numbers above are measured WITH this Triton fallback. Once PR #38891 (per-layer attention backend selection) is merged, expect a substantial speed improvement across all formats — potentially 30-50% faster.
Track the fix:
- Issue: vllm-project/vllm#38887
- PR: vllm-project/vllm#38891
About the Model
Gemma 4 E4B uses Per-Layer Embedding (PLE) architecture — not MoE, not traditional dense:
- Effective parameters: 4B (compute per token)
- Total parameters: 8B (including PLE embedding tables)
- PLE embedding tables (~5.4 GB in BF16) are large but only used for fast lookups — not quantized
- 42 decoder layers (36 sliding attention + 6 global attention)
- Supports: text, vision, audio, 128K context, tool calling
PLE Bandwidth Impact
The PLE tables account for ~36% of the model but consume disproportionate bandwidth because they are accessed every token and remain in BF16. This is why E4B (50 tok/s) is slightly slower than the 26B-A4B MoE (52 tok/s) despite being a much smaller model — MoE only reads 3.8B active parameters per token, while E4B reads 4B compute weights plus 5.4 GB of PLE lookups.
Quantization Details
- Method: NVFP4A16 (weight-only, 4-bit FP, 16-bit activations)
- Tool: llm-compressor v0.10.1.dev (git main, post-PR #2561)
- Data-free: No calibration dataset required
- Ignored layers:
lm_head,vision_tower,audio_tower,embed_vision,embed_audio
Recipe
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4A16",
ignore=[
"lm_head",
"re:.*vision_tower.*",
"re:.*audio_tower.*",
"re:.*embed_vision.*",
"re:.*embed_audio.*",
],
)
oneshot(model=model, recipe=recipe)
Installation Note (as of April 2026)
llm-compressor on PyPI (v0.10.0.1) does not support Gemma 4 due to a transformers version conflict. You must install from git main:
pip install 'git+https://github.com/vllm-project/llm-compressor.git@main'
pip install --force-reinstall --no-deps 'transformers>=5.5' 'huggingface_hub>=0.30'
pip install torchvision # required for Gemma4VideoProcessor
Usage with vLLM
vllm serve coolthor/Gemma-4-E4B-it-NVFP4A16 \
--quantization compressed-tensors \
--kv-cache-dtype fp8 \
--max-model-len 16384
Or with Docker:
docker run -d \
--gpus all --ipc host --shm-size 32gb \
-p 8000:8000 \
-v /path/to/model:/models/gemma4-e4b \
vllm/vllm-openai:gemma4-cu130 \
--model /models/gemma4-e4b \
--served-model-name gemma-4-e4b \
--quantization compressed-tensors \
--kv-cache-dtype fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.75 \
--reasoning-parser gemma4 \
--enable-auto-tool-choice --tool-call-parser pythonic
Quick Alternative: FP8 Online (no download needed)
If you just want a speed boost without downloading a quantized checkpoint:
vllm serve google/gemma-4-E4B-it \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 16384
This gives 36 tok/s (1.9x over BF16) with zero extra work.
Quality Checks
| Test | Result |
|---|---|
| Chinese (Traditional) | Fluent, correct BPS explanation |
| Math/Reasoning | Put spread max profit/loss calculation correct |
| Code Generation | Black-Scholes implementation complete and correct |
| Structured Output (JSON) | Valid JSON format |
| Repetition Test (count 1-50) | No duplicates, no degradation |
Hardware Tested
- NVIDIA DGX Spark (GB10, SM121, 128GB unified memory, 273 GB/s)
- vLLM v0.18.2rc1 (gemma4-cu130 image)
Links
- Base model: google/gemma-4-E4B-it
- Quantization recipe: llm-compressor PR #2561
- Triton fallback issue: vllm-project/vllm#38887
- Full benchmark article: ai-muninn.com
- Downloads last month
- -
Model tree for coolthor/Gemma-4-E4B-it-NVFP4A16
Base model
google/gemma-4-E4B-it