Gemma 4 E4B — MLX 4-bit | Tool Calling ✅ | Apple Silicon

The fastest 4B multimodal model on Apple Silicon. Tool calling, TurboQuant 4.6x KV compression, Opus Reasoning LoRA, Ollama ready. 4.86 GB.

Tool Calling ✅ · Built by RavenX AI · Apple Silicon Native

Downloads TurboQuant LoRA Gemini CLI License


Gemma 4 E4B-it quantized to MLX 4-bit (affine, group_size=64) for Apple Silicon — with the full RavenX AI stack built on top: Opus reasoning fine-tuning, TurboQuant KV cache compression, and Gemini CLI terminal tooling.

4.86 GB. 131K context. Text + vision. Runs on any M-series Mac.


🗂 What's in this stack

Component What it does Link
This model Gemma 4 E4B 4-bit MLX — 4.86 GB, 131K ctx You are here
Opus Reasoning LoRA Adds <think>-tag reasoning, trained on Claude Opus 4.6 traces ↗ adapter repo
Fused version LoRA baked into weights — no adapter needed ↗ gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit
TurboQuant-MLX 4.6x KV cache compression — run longer contexts at same RAM ↗ GitHub
Gemini CLI fork MCP-enabled terminal AI agent with Gemini 3 + 1M ctx ↗ GitHub

Model Details

Property Value
Base model google/gemma-4-E4B-it
Architecture Gemma4ForConditionalGeneration
Parameters ~4B active
Modalities Text · Vision · Audio
Quantization 4-bit affine, group_size=64
File size 4.86 GB (down from ~17 GB bf16)
Context window 131,072 tokens
Vocab size 262,144
Hidden size 2,560
Layers 42 (35× sliding + 7× full attention)
Attention heads 8 (KV heads: 2)
Sliding window 512
Vision encoder 768 hidden · 16 layers · patch 16px

⚡ Performance (Apple Silicon)

Chip RAM Tok/sec (est)
M4 Max 128GB ~55–70
M3 Ultra 192GB ~60–80
M3 Pro 36GB ~35–50
M2 Pro 32GB ~20–30
M1 Air 16GB ~12–20

Runs entirely on unified memory — no GPU VRAM limits. Full model fits in ~6 GB, leaving 10+ GB for context.


🚀 Quickstart

Install

pip install mlx-lm mlx-vlm

Text generation

from mlx_lm import load, generate

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
messages = [{"role": "user", "content": "Explain quantum entanglement simply."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)

Vision (image + text)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
messages = [{"role": "user", "content": [
    {"type": "image", "image": "https://example.com/photo.jpg"},
    {"type": "text",  "text": "Describe this image in detail."}
]}]
prompt = apply_chat_template(processor, model.config, messages, add_generation_prompt=True)
response = generate(model, processor, prompt=prompt, max_tokens=512)

CLI

mlx_lm.generate \
  --model deadbydawn101/gemma-4-E4B-mlx-4bit \
  --prompt "Write a Python function to find all primes below N." \
  --max-tokens 512

OpenAI-compatible server

mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080

Ollama

ollama run hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

🧠 Opus Reasoning + Claude Code LoRA

Fine-tune behavior with the Opus Reasoning + Claude Code LoRA — trained on Claude Opus 4.6 reasoning traces and real Claude Code tool-use patterns.

What it teaches the model

Behavior Source
<think> tag chain-of-thought before every answer Opus 4.6 reasoning traces
Multi-step problem decomposition Crownelius/Opus-4.6-Reasoning-2100x-formatted
Tool call patterns (read/write/bash/search loops) 140 Claude Code session files
Structured completion style SFT on completions only (not memorization)

Training results

Dataset:  2,054 train · 109 val · SFT completions-only
Hardware: Apple M4 Max 128GB · Peak mem: 7.876 GB
Runtime:  ~6 min for 1,000 iterations @ ~190 tok/sec

Iter 10   →  2.277   ← cold start
Iter 20   →  0.097   ← style locked in fast
Iter 50   →  0.00063
Iter 100  →  0.0000398
Iter 200  →  0.0000067  (checkpoint)
Iter 1000 →  ~3.5e-7   (final)

Loss collapsed early and hard — the Opus reasoning patterns transferred cleanly to Gemma 4's hybrid attention architecture.

Apply the LoRA

from mlx_vlm import load, generate

model, processor = load(
    "deadbydawn101/gemma-4-E4B-mlx-4bit",
    adapter_path="deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora",
)

Or use the fused model (no adapter needed)

# LoRA baked directly into weights
mlx_lm.generate --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "..."

gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit


⚡ TurboQuant-MLX — 4.6x KV Cache Compression

TurboQuant-MLX is a RavenX AI project that compresses the KV cache using PolarQuant (rotation-based quantization) + QJL (1-bit residual correction) — enabling dramatically longer contexts at the same memory budget.

Without TurboQuant With TurboQuant
Context @ same RAM 8K 36K
KV cache growth Linear Compressed
Accuracy impact Near-zero
from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module

# Drop-in patch — one line before loading
cache_module.make_prompt_cache = lambda model, **kw: [
    TurboQuantKVCache() for _ in range(len(model.layers))
]

from mlx_vlm import load, generate
model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
# Context is now compressed automatically — run it as normal

TurboQuant-MLX on GitHub · Release v2.0


💻 Gemini CLI — MCP Terminal Agent

RavenX AI's Gemini CLI fork is an MCP-enabled terminal AI agent — Google Search grounding, file ops, shell commands, web fetching, and 1M token context.

# Install
npm install -g @google/gemini-cli

# Run against local MLX server (self-hosted with this model)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080 "Analyze this codebase and suggest improvements"
  • Free tier: 60 req/min · 1,000 req/day
  • MCP support: connect any tool via Model Context Protocol
  • Built-in tools: search grounding, file read/write, shell, web fetch

🏗 Architecture

Gemma 4 uses hybrid sliding/full attention:

  • 35× sliding attention (window=512) — O(n) local context, fast
  • 7× full attention — global coherence at regular intervals

This gives near-linear memory scaling for long sequences while maintaining full document coherence — ideal for the TurboQuant + long-context use case.


💻 Gemini CLI — Coding Agent + Tool Orchestration

We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.

# Install
npm install -g @google/gemini-cli

# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080

# Or use directly against Gemini API (free tier: 60 req/min)
gemini

What Gemini CLI + these models unlock together

Capability How
Code generation Gemini CLI reads your codebase, model reasons with <think> tags
Tool calling Native <|tool> tokens → Gemini CLI executes shell/file/web tools
Long context 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions
MCP servers Connect any MCP server — databases, APIs, custom tools
Search grounding Google Search built in — model gets live data
# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
  "Review all Python files in ./src, find potential bugs, and suggest fixes"

# Gemini CLI will: read files → call tools → model reasons → produce structured output

DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible

🛠️ Tool Calling (Function Calling)

Gemma 4 has native tool calling built into its chat template. Most models on HuggingFace don't support this — Gemma 4 does, using <|tool>, <|tool_call>, and <|tool_response> special tokens.

Define tools and call them

from mlx_lm import load, generate
import json

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country"},
                    "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

messages = [{"role": "user", "content": "What's the weather in San Jose, CA?"}]
prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
# Model responds with a structured tool_call in <|tool_call>...<tool_call|> format

Parse tool calls and feed results back

# After tool execution, feed the result back
messages += [
    {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": {"location": "San Jose, CA"}}}]},
    {"role": "tool", "tool_responses": [{"name": "get_weather", "response": {"temp": 72, "condition": "sunny"}}]}
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False)
final = generate(model, tokenizer, prompt=prompt, max_tokens=256)

With mlx_vlm (multimodal + tools)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
prompt = apply_chat_template(
    processor, model.config, messages,
    tools=tools, add_generation_prompt=True
)

Tool token format (native)

Token Purpose
<|tool>...<tool|> Tool definition block
<|tool_call>call:name{args}<tool_call|> Model calls a tool
<|tool_response>...<tool_response|> Result returned to model

🦙 Ollama — One-Command Setup

Instant run (no install needed)

ollama run hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

With a custom system prompt + tool support

Create a Modelfile:

FROM hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit

SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step."

PARAMETER temperature 0.7
PARAMETER num_ctx 8192
ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4

OpenAI-compatible endpoint

# Ollama exposes an OpenAI-compatible API automatically
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/deadbydawn101/gemma-4-E4B-mlx-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Run with mlx_lm server (native, faster on Apple Silicon)

# mlx_lm server is faster than Ollama for Apple Silicon — uses Metal GPU directly
mlx_lm.server --model deadbydawn101/gemma-4-E4B-mlx-4bit --port 8080

# Then use any OpenAI client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "deadbydawn101/gemma-4-E4B-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'

🔧 Conversion Details

Step Detail
Source google/gemma-4-E4B-it (bfloat16, ~17 GB)
Tool mlx_vlm.convert --q-bits 4 --q-group-size 64 --q-mode affine
Platform Apple M4 Max 128GB
Output 4.86 GB · ~4.8 bits/weight · 3 shards
LoRA training mlx_vlm.lora SFT · rank=8 · alpha=16 · 1k iters
LoRA fusion mlx_lm fuse — baked into ravenx-opus variant

📦 Full RavenX Model Collection

Model Size Description
gemma-4-E4B-mlx-4bit 4.86 GB This model — clean 4-bit E4B base
gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit ~4.86 GB Fused: base + Opus reasoning LoRA baked in
gemma-4-E4B-opus-reasoning-claude-code-lora 658 MB LoRA adapter only
gemma-4-E2B-Heretic-Uncensored-mlx-4bit 3.34 GB 2B abliterated (uncensored)
gemma-4-21b-REAP-Tool-Calling-mlx-4bit 12 GB 21B REAP-pruned MoE

License

Gemma Terms of Use — free for research and commercial use with attribution.


Built with 🖤 by RavenX AI · TurboQuant-MLX · Gemini CLI

TriAttention KV Compression

[2026-04-09] Our MLX port was merged into TriAttention (MIT + NVIDIA) — PR #1 by @DeadByDawn101 (RavenX AI).

Apply 10.7x KV memory reduction and 2.5x throughput on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:

from mlx_lm import load
from triattention.mlx import apply_triattention_mlx

model, tokenizer = load("deadbydawn101/gemma-4-E4B-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)

RavenX Inference Harness

One-command inference, benchmarking, and local OpenAI-compatible server:

git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness

# Inference
python run.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --prompt "Your prompt"

# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --triattention --kv-budget 2048

# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E4B-mlx-4bit --triattention
Downloads last month
2,158
Safetensors
Model size
2B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deadbydawn101/gemma-4-E4B-mlx-4bit

Quantized
(61)
this model
Adapters
1 model
Finetunes
1 model

Dataset used to train deadbydawn101/gemma-4-E4B-mlx-4bit

Space using deadbydawn101/gemma-4-E4B-mlx-4bit 1

Collection including deadbydawn101/gemma-4-E4B-mlx-4bit