Gemma 4 E4B — MLX 4-bit | Tool Calling ✅ | Apple Silicon

The fastest 4B multimodal model on Apple Silicon. Tool calling, TurboQuant 4.6x KV compression, Opus Reasoning LoRA, Ollama ready. 4.86 GB.

Tool Calling ✅ · Built by RavenX AI · Apple Silicon Native

Gemma 4 E4B-it quantized to MLX 4-bit (affine, group_size=64) for Apple Silicon — with the full RavenX AI stack built on top: Opus reasoning fine-tuning, TurboQuant KV cache compression, and Gemini CLI terminal tooling.

4.86 GB. 131K context. Text + vision. Runs on any M-series Mac.

🗂 What's in this stack

Component	What it does	Link
This model	Gemma 4 E4B 4-bit MLX — 4.86 GB, 131K ctx	You are here
Opus Reasoning LoRA	Adds `<think>`-tag reasoning, trained on Claude Opus 4.6 traces	↗ adapter repo
Fused version	LoRA baked into weights — no adapter needed	↗ gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit
TurboQuant-MLX	4.6x KV cache compression — run longer contexts at same RAM	↗ GitHub
Gemini CLI fork	MCP-enabled terminal AI agent with Gemini 3 + 1M ctx	↗ GitHub

Model Details

Property	Value
Base model	google/gemma-4-E4B-it
Architecture	Gemma4ForConditionalGeneration
Parameters	~4B active
Modalities	Text · Vision · Audio
Quantization	4-bit affine, group_size=64
File size	4.86 GB (down from ~17 GB bf16)
Context window	131,072 tokens
Vocab size	262,144
Hidden size	2,560
Layers	42 (35× sliding + 7× full attention)
Attention heads	8 (KV heads: 2)
Sliding window	512
Vision encoder	768 hidden · 16 layers · patch 16px

⚡ Performance (Apple Silicon)

Chip	RAM	Tok/sec (est)
M4 Max	128GB	~55–70
M3 Ultra	192GB	~60–80
M3 Pro	36GB	~35–50
M2 Pro	32GB	~20–30
M1 Air	16GB	~12–20

Runs entirely on unified memory — no GPU VRAM limits. Full model fits in ~6 GB, leaving 10+ GB for context.

🚀 Quickstart

Install

pip install mlx-lm mlx-vlm

Text generation

from mlx_lm import load, generate

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
messages = [{"role": "user", "content": "Explain quantum entanglement simply."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)

Vision (image + text)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
messages = [{"role": "user", "content": [
    {"type": "image", "image": "https://example.com/photo.jpg"},
    {"type": "text",  "text": "Describe this image in detail."}
]}]
prompt = apply_chat_template(processor, model.config, messages, add_generation_prompt=True)
response = generate(model, processor, prompt=prompt, max_tokens=512)

CLI

mlx_lm.generate \
  --model deadbydawn101/gemma-4-E4B-mlx-4bit \
  --prompt "Write a Python function to find all primes below N." \
  --max-tokens 512

OpenAI-compatible server

mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080

Ollama

ollama run hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

🧠 Opus Reasoning + Claude Code LoRA

Fine-tune behavior with the Opus Reasoning + Claude Code LoRA — trained on Claude Opus 4.6 reasoning traces and real Claude Code tool-use patterns.

What it teaches the model

Behavior	Source
`<think>` tag chain-of-thought before every answer	Opus 4.6 reasoning traces
Multi-step problem decomposition	Crownelius/Opus-4.6-Reasoning-2100x-formatted
Tool call patterns (read/write/bash/search loops)	140 Claude Code session files
Structured completion style	SFT on completions only (not memorization)

Training results

Dataset:  2,054 train · 109 val · SFT completions-only
Hardware: Apple M4 Max 128GB · Peak mem: 7.876 GB
Runtime:  ~6 min for 1,000 iterations @ ~190 tok/sec

Iter 10   →  2.277   ← cold start
Iter 20   →  0.097   ← style locked in fast
Iter 50   →  0.00063
Iter 100  →  0.0000398
Iter 200  →  0.0000067  (checkpoint)
Iter 1000 →  ~3.5e-7   (final)

Loss collapsed early and hard — the Opus reasoning patterns transferred cleanly to Gemma 4's hybrid attention architecture.

Apply the LoRA

from mlx_vlm import load, generate

model, processor = load(
    "deadbydawn101/gemma-4-E4B-mlx-4bit",
    adapter_path="deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora",
)

Or use the fused model (no adapter needed)

# LoRA baked directly into weights
mlx_lm.generate --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "..."

→ gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit

⚡ TurboQuant-MLX — 4.6x KV Cache Compression

TurboQuant-MLX is a RavenX AI project that compresses the KV cache using PolarQuant (rotation-based quantization) + QJL (1-bit residual correction) — enabling dramatically longer contexts at the same memory budget.

	Without TurboQuant	With TurboQuant
Context @ same RAM	8K	36K
KV cache growth	Linear	Compressed
Accuracy impact	—	Near-zero

from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module

# Drop-in patch — one line before loading
cache_module.make_prompt_cache = lambda model, **kw: [
    TurboQuantKVCache() for _ in range(len(model.layers))
]

from mlx_vlm import load, generate
model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
# Context is now compressed automatically — run it as normal

→ TurboQuant-MLX on GitHub · Release v2.0

💻 Gemini CLI — MCP Terminal Agent

RavenX AI's Gemini CLI fork is an MCP-enabled terminal AI agent — Google Search grounding, file ops, shell commands, web fetching, and 1M token context.

# Install
npm install -g @google/gemini-cli

# Run against local MLX server (self-hosted with this model)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080 "Analyze this codebase and suggest improvements"

Free tier: 60 req/min · 1,000 req/day
MCP support: connect any tool via Model Context Protocol
Built-in tools: search grounding, file read/write, shell, web fetch

🏗 Architecture

Gemma 4 uses hybrid sliding/full attention:

35× sliding attention (window=512) — O(n) local context, fast
7× full attention — global coherence at regular intervals

This gives near-linear memory scaling for long sequences while maintaining full document coherence — ideal for the TurboQuant + long-context use case.

💻 Gemini CLI — Coding Agent + Tool Orchestration

We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.

# Install
npm install -g @google/gemini-cli

# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080

# Or use directly against Gemini API (free tier: 60 req/min)
gemini

What Gemini CLI + these models unlock together

Capability	How
Code generation	Gemini CLI reads your codebase, model reasons with `<think>` tags
Tool calling	Native `<\|tool>` tokens → Gemini CLI executes shell/file/web tools
Long context	1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions
MCP servers	Connect any MCP server — databases, APIs, custom tools
Search grounding	Google Search built in — model gets live data

# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
  "Review all Python files in ./src, find potential bugs, and suggest fixes"

# Gemini CLI will: read files → call tools → model reasons → produce structured output

→ DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible

🛠️ Tool Calling (Function Calling)

Gemma 4 has native tool calling built into its chat template. Most models on HuggingFace don't support this — Gemma 4 does, using <|tool>, <|tool_call>, and <|tool_response> special tokens.

Define tools and call them

from mlx_lm import load, generate
import json

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country"},
                    "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

messages = [{"role": "user", "content": "What's the weather in San Jose, CA?"}]
prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
# Model responds with a structured tool_call in <|tool_call>...<tool_call|> format

Parse tool calls and feed results back

# After tool execution, feed the result back
messages += [
    {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": {"location": "San Jose, CA"}}}]},
    {"role": "tool", "tool_responses": [{"name": "get_weather", "response": {"temp": 72, "condition": "sunny"}}]}
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False)
final = generate(model, tokenizer, prompt=prompt, max_tokens=256)

With mlx_vlm (multimodal + tools)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
prompt = apply_chat_template(
    processor, model.config, messages,
    tools=tools, add_generation_prompt=True
)

Tool token format (native)

Token	Purpose
`<\|tool>...<tool\|>`	Tool definition block
`<\|tool_call>call:name{args}<tool_call\|>`	Model calls a tool
`<\|tool_response>...<tool_response\|>`	Result returned to model

🦙 Ollama — One-Command Setup

Instant run (no install needed)

ollama run hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

With a custom system prompt + tool support

Create a Modelfile:

FROM hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step."

PARAMETER temperature 0.7
PARAMETER num_ctx 8192

ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4

OpenAI-compatible endpoint

# Ollama exposes an OpenAI-compatible API automatically
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Run with mlx_lm server (native, faster on Apple Silicon)

# mlx_lm server is faster than Ollama for Apple Silicon — uses Metal GPU directly
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080

# Then use any OpenAI client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "deadbydawn101/gemma-4-E4B-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'

🔧 Conversion Details

Step	Detail
Source	`google/gemma-4-E4B-it` (bfloat16, ~17 GB)
Tool	`mlx_vlm.convert --q-bits 4 --q-group-size 64 --q-mode affine`
Platform	Apple M4 Max 128GB
Output	4.86 GB · ~4.8 bits/weight · 3 shards
LoRA training	`mlx_vlm.lora` SFT · rank=8 · alpha=16 · 1k iters
LoRA fusion	`mlx_lm fuse` — baked into ravenx-opus variant

📦 Full RavenX Model Collection

Model	Size	Description
gemma-4-E4B-mlx-4bit	4.86 GB	This model — clean 4-bit E4B base
gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit	~4.86 GB	Fused: base + Opus reasoning LoRA baked in
gemma-4-E4B-opus-reasoning-claude-code-lora	658 MB	LoRA adapter only
gemma-4-E2B-Heretic-Uncensored-mlx-4bit	3.34 GB	2B abliterated (uncensored)
gemma-4-21b-REAP-Tool-Calling-mlx-4bit	12 GB	21B REAP-pruned MoE

License

Gemma Terms of Use — free for research and commercial use with attribution.

Built with 🖤 by RavenX AI · TurboQuant-MLX · Gemini CLI

TriAttention KV Compression

[2026-04-09] Our MLX port was merged into TriAttention (MIT + NVIDIA) — PR #1 by @DeadByDawn101 (RavenX AI).

Apply 10.7x KV memory reduction and 2.5x throughput on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:

from mlx_lm import load
from triattention.mlx import apply_triattention_mlx

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)

RavenX Inference Harness

One-command inference, benchmarking, and local OpenAI-compatible server:

git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness

# Inference
python run.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --prompt "Your prompt"

# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --triattention --kv-budget 2048

# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --triattention