Gemma-4-E2B-NVFP4A16

NVFP4 quantization of google/gemma-4-E2B β€” Google's Gemma 4 E2B base (pre-trained) with Per-Layer Embeddings (PLE). 2B effective parameters, multimodal (text + image + audio), 128K context.

W4A16 β€” language model weights in FP4, activations in FP16. Vision and audio towers stay BF16.

See also Gemma-4-E2B-NVFP4 for the full W4A4 variant.

Key Specs

Original (BF16) NVFP4 (this)
Size on disk ~10 GB ~7.5 GB
Compression β€” ~1.3x (text layers 3x, vision/audio stay BF16)
Effective parameters 2B 2B
Architecture Dense + PLE (Per-Layer Embeddings) same
Context window 128K tokens 128K tokens
Modalities Text, Image, Audio Text, Image, Audio

What is PLE?

Unlike the Gemma 4 26B which uses Mixture-of-Experts (MoE), the E2B uses Per-Layer Embeddings β€” a learned per-layer specialization mechanism. Each of the 35 decoder layers gets its own 256-dimensional signal derived from both the token identity (via a second embedding table) and the evolving hidden representation. This is a continuous alternative to discrete MoE routing β€” no expert selection, no sparsity, just dense computation with layer-specific conditioning.

Speed

Tested on DGX Spark (GB10 Blackwell, SM 12.1):

Metric NVFP4
Tokens/sec ~90 tok/s
Model load ~7.5 GB VRAM

Serving with vLLM

Requires vLLM with transformers >= 5.4 (for Gemma 4 architecture support). No patches needed β€” vanilla vLLM handles E2B NVFP4 directly.

vllm serve bg-digitalservices/Gemma-4-E2B-NVFP4A16 \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.30 \
  --max-model-len 131072 \
  --trust-remote-code

DGX Spark

VLLM_NVFP4_GEMM_BACKEND=marlin vllm serve bg-digitalservices/Gemma-4-E2B-NVFP4A16 \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.10 \
  --max-model-len 131072 \
  --trust-remote-code

Testing

This is a base model. Use completions or chat endpoints:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"bg-digitalservices/Gemma-4-E2B-NVFP4A16","messages":[{"role":"user","content":"Hello!"}],"max_tokens":200}'

Quantization Details

  • Method: NVIDIA Model Optimizer (modelopt) v0.43
  • Format: NVFP4 β€” E2M1 weights with per-group FP8 scales (group size 16)
  • Calibration: 512 samples from CNN/DailyMail, batch size 4, seq_len 1024
  • Excluded from quantization: Vision tower, audio tower, vision/audio projection layers, lm_head (all stay BF16)
  • Hardware: NVIDIA DGX Spark (GB10 Blackwell)
  • Quantization script: included as quantize.py

Limitations

  • Requires vLLM with transformers >= 5.4
  • Vision/audio towers stay BF16 (β‰ˆ5 GB of the 7.5 GB total) β€” excluded to preserve multimodal quality
  • Community quantization, not an official NVIDIA or Google release

License

Apache 2.0 β€” inherited from the base model.

Credits

Quantized by Mario Iseli on an NVIDIA DGX Spark. Built and validated with AI-engineering assistance from Anthropic.

Shout-out to eugr/spark-vllm-docker for the DGX Spark-optimized vLLM build.

πŸ“¬ mario@marioiseli.com β˜• Buy me a coffee if this makes your Spark go brrrrrr! πŸš€

Downloads last month
9,232
Safetensors
Model size
4B params
Tensor type
BF16
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bg-digitalservices/Gemma-4-E2B-NVFP4A16

Quantized
(5)
this model