Community resources for Gemma 4 deployment — mobile, local, and cloud paths

#27

by Linncharm - opened 2 days ago

Putting together a few things that weren't easy to find in the official
docs, in case it's useful for others getting started.

Mobile deployment (E2B / E4B)

Android has the clearest official path right now:

Google AI Edge Gallery — fastest way to test on-device
LiteRT-LM — gets E2B running under 1.5GB RAM on supported devices
ML Kit GenAI Prompt API — for building your own Android app

iOS is still a developer story via MediaPipe LLM Inference SDK,
no consumer app yet.

Local inference quick reference

Runtime	Command
Ollama	`ollama pull gemma4:e4b`
llama.cpp	standard GGUF load
LM Studio	search "gemma4" in model browser
MLX (Apple Silicon)	via Unsloth MLX builds

Note: if you're hitting OOM on the 31B, the KV cache at long context
is substantial — --ctx-size 8192 --cache-type-k q4_0 helps as a
workaround until better fixes land.

Live playground

If you want to test the 26B A4B without spinning anything up locally,
I put together a playground at gemma4.app —
runs via OpenRouter, no signup needed.

Happy to add anything else that's missing here.

srikanta-221

Google org 1 day ago

Hi @Linncharm ,

Thanks for putting this together, this is a clear and practical summary that should be useful for others getting started.

On mobile deployment, your breakdown aligns with the current recommended paths. Android has the most complete end-to-end story today, with AI edge gallery for quick on-device validation and the ML Kit GenAI Prompt API for production apps. LiteRT-LM is a good option for tighter memory constraints, with expected variation based on model variant, context length and quantization. For iOS, this is accurate, the MediaPipe LLM Inference SDK is the primary developer path today.

The local inference section is also a solid quick reference. One addition worth calling out is that users should run the latest builds of llama.cpp or MLX, as backend optimisations are improving performance and memory efficiency quickly.

+1 on the KV cache note, memory usage scales significantly with context length on larger models, so reducing context size or using lower-precision KV cache is a practical way to avoid OOM issues in constrained environments.

The playground is also a useful addition for quick testing. Hosted environments are a good way to evaluate model behavior before setting up local inference.

Thanks again for sharing this.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment