Community resources for Gemma 4 deployment β mobile, local, and cloud paths
Putting together a few things that weren't easy to find in the official
docs, in case it's useful for others getting started.
Mobile deployment (E2B / E4B)
Android has the clearest official path right now:
- Google AI Edge Gallery β fastest way to test on-device
- LiteRT-LM β gets E2B running under 1.5GB RAM on supported devices
- ML Kit GenAI Prompt API β for building your own Android app
iOS is still a developer story via MediaPipe LLM Inference SDK,
no consumer app yet.
Local inference quick reference
| Runtime | Command |
|---|---|
| Ollama | ollama pull gemma4:e4b |
| llama.cpp | standard GGUF load |
| LM Studio | search "gemma4" in model browser |
| MLX (Apple Silicon) | via Unsloth MLX builds |
Note: if you're hitting OOM on the 31B, the KV cache at long context
is substantial β --ctx-size 8192 --cache-type-k q4_0 helps as a
workaround until better fixes land.
Live playground
If you want to test the 26B A4B without spinning anything up locally,
I put together a playground at gemma4.app β
runs via OpenRouter, no signup needed.
Happy to add anything else that's missing here.
Hi @Linncharm ,
Thanks for putting this together, this is a clear and practical summary that should be useful for others getting started.
On mobile deployment, your breakdown aligns with the current recommended paths. Android has the most complete end-to-end story today, with AI edge gallery for quick on-device validation and the ML Kit GenAI Prompt API for production apps. LiteRT-LM is a good option for tighter memory constraints, with expected variation based on model variant, context length and quantization. For iOS, this is accurate, the MediaPipe LLM Inference SDK is the primary developer path today.
The local inference section is also a solid quick reference. One addition worth calling out is that users should run the latest builds of llama.cpp or MLX, as backend optimisations are improving performance and memory efficiency quickly.
+1 on the KV cache note, memory usage scales significantly with context length on larger models, so reducing context size or using lower-precision KV cache is a practical way to avoid OOM issues in constrained environments.
The playground is also a useful addition for quick testing. Hosted environments are a good way to evaluate model behavior before setting up local inference.
Thanks again for sharing this.