Gemma-4-E2B-NVFP4A16

NVFP4 quantization of google/gemma-4-E2B — Google's Gemma 4 E2B base (pre-trained) with Per-Layer Embeddings (PLE). 2B effective parameters, multimodal (text + image + audio), 128K context.

W4A16 — language model weights in FP4, activations in FP16. Vision and audio towers stay BF16.

See also Gemma-4-E2B-NVFP4 for the full W4A4 variant.

Key Specs

	Original (BF16)	NVFP4 (this)
Size on disk	~10 GB	~7.5 GB
Compression	—	~1.3x (text layers 3x, vision/audio stay BF16)
Effective parameters	2B	2B
Architecture	Dense + PLE (Per-Layer Embeddings)	same
Context window	128K tokens	128K tokens
Modalities	Text, Image, Audio	Text, Image, Audio

What is PLE?

Unlike the Gemma 4 26B which uses Mixture-of-Experts (MoE), the E2B uses Per-Layer Embeddings — a learned per-layer specialization mechanism. Each of the 35 decoder layers gets its own 256-dimensional signal derived from both the token identity (via a second embedding table) and the evolving hidden representation. This is a continuous alternative to discrete MoE routing — no expert selection, no sparsity, just dense computation with layer-specific conditioning.

Speed

Tested on DGX Spark (GB10 Blackwell, SM 12.1):

Metric	NVFP4
Tokens/sec	~90 tok/s
Model load	~7.5 GB VRAM

Serving with vLLM

Requires vLLM with transformers >= 5.4 (for Gemma 4 architecture support). No patches needed — vanilla vLLM handles E2B NVFP4 directly.

vllm serve bg-digitalservices/Gemma-4-E2B-NVFP4A16 \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.30 \
  --max-model-len 131072 \
  --trust-remote-code

DGX Spark

VLLM_NVFP4_GEMM_BACKEND=marlin vllm serve bg-digitalservices/Gemma-4-E2B-NVFP4A16 \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.10 \
  --max-model-len 131072 \
  --trust-remote-code

Testing

This is a base model. Use completions or chat endpoints:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"bg-digitalservices/Gemma-4-E2B-NVFP4A16","messages":[{"role":"user","content":"Hello!"}],"max_tokens":200}'

Quantization Details

Method: NVIDIA Model Optimizer (modelopt) v0.43
Format: NVFP4 — E2M1 weights with per-group FP8 scales (group size 16)
Calibration: 512 samples from CNN/DailyMail, batch size 4, seq_len 1024
Excluded from quantization: Vision tower, audio tower, vision/audio projection layers, lm_head (all stay BF16)
Hardware: NVIDIA DGX Spark (GB10 Blackwell)
Quantization script: included as quantize.py

Limitations

Requires vLLM with transformers >= 5.4
Vision/audio towers stay BF16 (≈5 GB of the 7.5 GB total) — excluded to preserve multimodal quality
Community quantization, not an official NVIDIA or Google release

License

Apache 2.0 — inherited from the base model.

Credits

Quantized by Mario Iseli on an NVIDIA DGX Spark. Built and validated with AI-engineering assistance from Anthropic.

Shout-out to eugr/spark-vllm-docker for the DGX Spark-optimized vLLM build.

📬 mario@marioiseli.com ☕ Buy me a coffee if this makes your Spark go brrrrrr! 🚀

Downloads last month: 9,232

Safetensors

Model size

4B params

Tensor type

BF16

F8_E4M3

Model tree for bg-digitalservices/Gemma-4-E2B-NVFP4A16

Base model

google/gemma-4-E2B

Quantized

(5)

this model