Gemma 4 E4B — MLX 4-bit | Tool Calling ✅ | Apple Silicon
The fastest 4B multimodal model on Apple Silicon. Tool calling, TurboQuant 4.6x KV compression, Opus Reasoning LoRA, Ollama ready. 4.86 GB.
Tool Calling ✅ · Built by RavenX AI · Apple Silicon Native
Gemma 4 E4B-it quantized to MLX 4-bit (affine, group_size=64) for Apple Silicon — with the full RavenX AI stack built on top: Opus reasoning fine-tuning, TurboQuant KV cache compression, and Gemini CLI terminal tooling.
4.86 GB. 131K context. Text + vision. Runs on any M-series Mac.
🗂 What's in this stack
| Component | What it does | Link |
|---|---|---|
| This model | Gemma 4 E4B 4-bit MLX — 4.86 GB, 131K ctx | You are here |
| Opus Reasoning LoRA | Adds <think>-tag reasoning, trained on Claude Opus 4.6 traces |
↗ adapter repo |
| Fused version | LoRA baked into weights — no adapter needed | ↗ gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit |
| TurboQuant-MLX | 4.6x KV cache compression — run longer contexts at same RAM | ↗ GitHub |
| Gemini CLI fork | MCP-enabled terminal AI agent with Gemini 3 + 1M ctx | ↗ GitHub |
Model Details
| Property | Value |
|---|---|
| Base model | google/gemma-4-E4B-it |
| Architecture | Gemma4ForConditionalGeneration |
| Parameters | ~4B active |
| Modalities | Text · Vision · Audio |
| Quantization | 4-bit affine, group_size=64 |
| File size | 4.86 GB (down from ~17 GB bf16) |
| Context window | 131,072 tokens |
| Vocab size | 262,144 |
| Hidden size | 2,560 |
| Layers | 42 (35× sliding + 7× full attention) |
| Attention heads | 8 (KV heads: 2) |
| Sliding window | 512 |
| Vision encoder | 768 hidden · 16 layers · patch 16px |
⚡ Performance (Apple Silicon)
| Chip | RAM | Tok/sec (est) |
|---|---|---|
| M4 Max | 128GB | ~55–70 |
| M3 Ultra | 192GB | ~60–80 |
| M3 Pro | 36GB | ~35–50 |
| M2 Pro | 32GB | ~20–30 |
| M1 Air | 16GB | ~12–20 |
Runs entirely on unified memory — no GPU VRAM limits. Full model fits in ~6 GB, leaving 10+ GB for context.
🚀 Quickstart
Install
pip install mlx-lm mlx-vlm
Text generation
from mlx_lm import load, generate
model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
messages = [{"role": "user", "content": "Explain quantum entanglement simply."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)
Vision (image + text)
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
messages = [{"role": "user", "content": [
{"type": "image", "image": "https://example.com/photo.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]}]
prompt = apply_chat_template(processor, model.config, messages, add_generation_prompt=True)
response = generate(model, processor, prompt=prompt, max_tokens=512)
CLI
mlx_lm.generate \
--model deadbydawn101/gemma-4-E4B-mlx-4bit \
--prompt "Write a Python function to find all primes below N." \
--max-tokens 512
OpenAI-compatible server
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080
Ollama
ollama run hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit
🧠 Opus Reasoning + Claude Code LoRA
Fine-tune behavior with the Opus Reasoning + Claude Code LoRA — trained on Claude Opus 4.6 reasoning traces and real Claude Code tool-use patterns.
What it teaches the model
| Behavior | Source |
|---|---|
<think> tag chain-of-thought before every answer |
Opus 4.6 reasoning traces |
| Multi-step problem decomposition | Crownelius/Opus-4.6-Reasoning-2100x-formatted |
| Tool call patterns (read/write/bash/search loops) | 140 Claude Code session files |
| Structured completion style | SFT on completions only (not memorization) |
Training results
Dataset: 2,054 train · 109 val · SFT completions-only
Hardware: Apple M4 Max 128GB · Peak mem: 7.876 GB
Runtime: ~6 min for 1,000 iterations @ ~190 tok/sec
Iter 10 → 2.277 ← cold start
Iter 20 → 0.097 ← style locked in fast
Iter 50 → 0.00063
Iter 100 → 0.0000398
Iter 200 → 0.0000067 (checkpoint)
Iter 1000 → ~3.5e-7 (final)
Loss collapsed early and hard — the Opus reasoning patterns transferred cleanly to Gemma 4's hybrid attention architecture.
Apply the LoRA
from mlx_vlm import load, generate
model, processor = load(
"deadbydawn101/gemma-4-E4B-mlx-4bit",
adapter_path="deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora",
)
Or use the fused model (no adapter needed)
# LoRA baked directly into weights
mlx_lm.generate --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "..."
→ gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit
⚡ TurboQuant-MLX — 4.6x KV Cache Compression
TurboQuant-MLX is a RavenX AI project that compresses the KV cache using PolarQuant (rotation-based quantization) + QJL (1-bit residual correction) — enabling dramatically longer contexts at the same memory budget.
| Without TurboQuant | With TurboQuant | |
|---|---|---|
| Context @ same RAM | 8K | 36K |
| KV cache growth | Linear | Compressed |
| Accuracy impact | — | Near-zero |
from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module
# Drop-in patch — one line before loading
cache_module.make_prompt_cache = lambda model, **kw: [
TurboQuantKVCache() for _ in range(len(model.layers))
]
from mlx_vlm import load, generate
model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
# Context is now compressed automatically — run it as normal
→ TurboQuant-MLX on GitHub · Release v2.0
💻 Gemini CLI — MCP Terminal Agent
RavenX AI's Gemini CLI fork is an MCP-enabled terminal AI agent — Google Search grounding, file ops, shell commands, web fetching, and 1M token context.
# Install
npm install -g @google/gemini-cli
# Run against local MLX server (self-hosted with this model)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080 "Analyze this codebase and suggest improvements"
- Free tier: 60 req/min · 1,000 req/day
- MCP support: connect any tool via Model Context Protocol
- Built-in tools: search grounding, file read/write, shell, web fetch
🏗 Architecture
Gemma 4 uses hybrid sliding/full attention:
- 35× sliding attention (window=512) — O(n) local context, fast
- 7× full attention — global coherence at regular intervals
This gives near-linear memory scaling for long sequences while maintaining full document coherence — ideal for the TurboQuant + long-context use case.
💻 Gemini CLI — Coding Agent + Tool Orchestration
We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.
Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.
# Install
npm install -g @google/gemini-cli
# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080
# Or use directly against Gemini API (free tier: 60 req/min)
gemini
What Gemini CLI + these models unlock together
| Capability | How |
|---|---|
| Code generation | Gemini CLI reads your codebase, model reasons with <think> tags |
| Tool calling | Native <|tool> tokens → Gemini CLI executes shell/file/web tools |
| Long context | 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions |
| MCP servers | Connect any MCP server — databases, APIs, custom tools |
| Search grounding | Google Search built in — model gets live data |
# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
"Review all Python files in ./src, find potential bugs, and suggest fixes"
# Gemini CLI will: read files → call tools → model reasons → produce structured output
→ DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible
🛠️ Tool Calling (Function Calling)
Gemma 4 has native tool calling built into its chat template. Most models on HuggingFace don't support this — Gemma 4 does, using <|tool>, <|tool_call>, and <|tool_response> special tokens.
Define tools and call them
from mlx_lm import load, generate
import json
model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and country"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}
]
messages = [{"role": "user", "content": "What's the weather in San Jose, CA?"}]
prompt = tokenizer.apply_chat_template(
messages,
tools=tools,
add_generation_prompt=True,
tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
# Model responds with a structured tool_call in <|tool_call>...<tool_call|> format
Parse tool calls and feed results back
# After tool execution, feed the result back
messages += [
{"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": {"location": "San Jose, CA"}}}]},
{"role": "tool", "tool_responses": [{"name": "get_weather", "response": {"temp": 72, "condition": "sunny"}}]}
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False)
final = generate(model, tokenizer, prompt=prompt, max_tokens=256)
With mlx_vlm (multimodal + tools)
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
prompt = apply_chat_template(
processor, model.config, messages,
tools=tools, add_generation_prompt=True
)
Tool token format (native)
| Token | Purpose |
|---|---|
<|tool>...<tool|> |
Tool definition block |
<|tool_call>call:name{args}<tool_call|> |
Model calls a tool |
<|tool_response>...<tool_response|> |
Result returned to model |
🦙 Ollama — One-Command Setup
Instant run (no install needed)
ollama run hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit
With a custom system prompt + tool support
Create a Modelfile:
FROM hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit
SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step."
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4
OpenAI-compatible endpoint
# Ollama exposes an OpenAI-compatible API automatically
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Run with mlx_lm server (native, faster on Apple Silicon)
# mlx_lm server is faster than Ollama for Apple Silicon — uses Metal GPU directly
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080
# Then use any OpenAI client
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "deadbydawn101/gemma-4-E4B-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'
🔧 Conversion Details
| Step | Detail |
|---|---|
| Source | google/gemma-4-E4B-it (bfloat16, ~17 GB) |
| Tool | mlx_vlm.convert --q-bits 4 --q-group-size 64 --q-mode affine |
| Platform | Apple M4 Max 128GB |
| Output | 4.86 GB · ~4.8 bits/weight · 3 shards |
| LoRA training | mlx_vlm.lora SFT · rank=8 · alpha=16 · 1k iters |
| LoRA fusion | mlx_lm fuse — baked into ravenx-opus variant |
📦 Full RavenX Model Collection
| Model | Size | Description |
|---|---|---|
| gemma-4-E4B-mlx-4bit | 4.86 GB | This model — clean 4-bit E4B base |
| gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit | ~4.86 GB | Fused: base + Opus reasoning LoRA baked in |
| gemma-4-E4B-opus-reasoning-claude-code-lora | 658 MB | LoRA adapter only |
| gemma-4-E2B-Heretic-Uncensored-mlx-4bit | 3.34 GB | 2B abliterated (uncensored) |
| gemma-4-21b-REAP-Tool-Calling-mlx-4bit | 12 GB | 21B REAP-pruned MoE |
License
Gemma Terms of Use — free for research and commercial use with attribution.
TriAttention KV Compression
[2026-04-09] Our MLX port was merged into TriAttention (MIT + NVIDIA) — PR #1 by @DeadByDawn101 (RavenX AI).
Apply 10.7x KV memory reduction and 2.5x throughput on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:
from mlx_lm import load
from triattention.mlx import apply_triattention_mlx
model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)
RavenX Inference Harness
One-command inference, benchmarking, and local OpenAI-compatible server:
git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness
# Inference
python run.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --prompt "Your prompt"
# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --triattention --kv-budget 2048
# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --triattention
- Downloads last month
- 2,158
4-bit