Gemma-4-E2B-NVFP4A16
NVFP4 quantization of google/gemma-4-E2B β Google's Gemma 4 E2B base (pre-trained) with Per-Layer Embeddings (PLE). 2B effective parameters, multimodal (text + image + audio), 128K context.
W4A16 β language model weights in FP4, activations in FP16. Vision and audio towers stay BF16.
See also Gemma-4-E2B-NVFP4 for the full W4A4 variant.
Key Specs
| Original (BF16) | NVFP4 (this) | |
|---|---|---|
| Size on disk | ~10 GB | ~7.5 GB |
| Compression | β | ~1.3x (text layers 3x, vision/audio stay BF16) |
| Effective parameters | 2B | 2B |
| Architecture | Dense + PLE (Per-Layer Embeddings) | same |
| Context window | 128K tokens | 128K tokens |
| Modalities | Text, Image, Audio | Text, Image, Audio |
What is PLE?
Unlike the Gemma 4 26B which uses Mixture-of-Experts (MoE), the E2B uses Per-Layer Embeddings β a learned per-layer specialization mechanism. Each of the 35 decoder layers gets its own 256-dimensional signal derived from both the token identity (via a second embedding table) and the evolving hidden representation. This is a continuous alternative to discrete MoE routing β no expert selection, no sparsity, just dense computation with layer-specific conditioning.
Speed
Tested on DGX Spark (GB10 Blackwell, SM 12.1):
| Metric | NVFP4 |
|---|---|
| Tokens/sec | ~90 tok/s |
| Model load | ~7.5 GB VRAM |
Serving with vLLM
Requires vLLM with transformers >= 5.4 (for Gemma 4 architecture support). No patches needed β vanilla vLLM handles E2B NVFP4 directly.
vllm serve bg-digitalservices/Gemma-4-E2B-NVFP4A16 \
--quantization modelopt \
--dtype auto \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.30 \
--max-model-len 131072 \
--trust-remote-code
DGX Spark
VLLM_NVFP4_GEMM_BACKEND=marlin vllm serve bg-digitalservices/Gemma-4-E2B-NVFP4A16 \
--quantization modelopt \
--dtype auto \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.10 \
--max-model-len 131072 \
--trust-remote-code
Testing
This is a base model. Use completions or chat endpoints:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"bg-digitalservices/Gemma-4-E2B-NVFP4A16","messages":[{"role":"user","content":"Hello!"}],"max_tokens":200}'
Quantization Details
- Method: NVIDIA Model Optimizer (modelopt) v0.43
- Format: NVFP4 β E2M1 weights with per-group FP8 scales (group size 16)
- Calibration: 512 samples from CNN/DailyMail, batch size 4, seq_len 1024
- Excluded from quantization: Vision tower, audio tower, vision/audio projection layers, lm_head (all stay BF16)
- Hardware: NVIDIA DGX Spark (GB10 Blackwell)
- Quantization script: included as
quantize.py
Limitations
- Requires vLLM with
transformers >= 5.4 - Vision/audio towers stay BF16 (β5 GB of the 7.5 GB total) β excluded to preserve multimodal quality
- Community quantization, not an official NVIDIA or Google release
License
Apache 2.0 β inherited from the base model.
Credits
Quantized by Mario Iseli on an NVIDIA DGX Spark. Built and validated with AI-engineering assistance from Anthropic.
Shout-out to eugr/spark-vllm-docker for the DGX Spark-optimized vLLM build.
π¬ mario@marioiseli.com β Buy me a coffee if this makes your Spark go brrrrrr! π
- Downloads last month
- 9,232
Model tree for bg-digitalservices/Gemma-4-E2B-NVFP4A16
Base model
google/gemma-4-E2B