Gemma 4 21B REAP — Tool Calling ✅ | 103 Experts | MLX 4-bit | Apple Silicon

The biggest open Gemma 4 on Apple Silicon. 21B MoE, REAP-pruned to 4 active experts, native tool calling, 12 GB MLX 4-bit. TurboQuant ready.

0xSero/gemma-4-21b-a4b-it-REAP converted to MLX 4-bit (affine, group_size=64) for native Apple Silicon inference.

This is Gemma 4 27B MoE pruned from 103 experts down to 4 active experts per token using REAP (Routing Expert Activation Pruning) — yielding a model that runs at ~21B total params but activates only a fraction per forward pass, combining large capacity with fast inference.

🖤 12 GB MLX 4-bit — runs on any M-series Mac with 24GB+ unified memory.
Multimodal: text + vision. 131K context window.

What is REAP?

REAP (Routing Expert Activation Pruning) is a technique from Cerebras that prunes MoE experts by analyzing routing patterns. Instead of activating many experts per token, REAP identifies which experts are actually essential and prunes the rest — resulting in:

Fewer experts activated per token (4 active out of 103 total)
Faster inference due to reduced compute per forward pass
Minimal quality loss — BoolQ accuracy 76%, HellaSwag 46% (see evals below)

Model Details

Property	Value
Base model	0xSero/gemma-4-21b-a4b-it-REAP
Original base	google/gemma-4-27b-it (MoE)
Architecture	Gemma4ForConditionalGeneration (MoE)
Total parameters	~21B
Total experts	103
Active experts/token	4 (REAP-pruned)
Modalities	Text · Vision
Quantization	4-bit affine, group_size=64, ~4.8 bits/weight
File size	12 GB (down from ~40 GB bf16)
Context window	131,072 tokens
Vocab size	262,144

Evaluation (from source model)

Benchmark	Score
BoolQ	76%
HellaSwag	46%
ARC-Challenge	28%

Performance (Apple Silicon)

Chip	RAM	Tok/sec (est)
M4 Max 128GB	128GB	~20–30 tok/s
M3 Ultra 192GB	192GB	~25–35 tok/s
M2 Ultra 192GB	192GB	~18–25 tok/s

Requires at least 24GB unified memory. 32GB+ recommended for comfortable operation.

Quickstart

Install

pip install mlx-lm mlx-vlm

Text generation

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")

messages = [{"role": "user", "content": "Explain mixture-of-experts models simply."}]
prompt = apply_chat_template(processor, model.config, messages)
response = generate(model, processor, prompt=prompt, max_tokens=512, verbose=True)

Vision (image + text)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "https://example.com/photo.jpg"},
        {"type": "text", "text": "Describe this image."}
    ]
}]
prompt = apply_chat_template(processor, model.config, messages, add_generation_prompt=True)
response = generate(model, processor, prompt=prompt, max_tokens=512)

CLI

mlx_vlm.generate \
  --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit \
  --prompt "What are the key differences between MoE and dense transformer models?" \
  --max-tokens 512

⚡ TurboQuant-MLX — 4.6x KV Cache Compression

Pair this model with TurboQuant-MLX — RavenX AI's Apple Silicon KV cache compression. Run 4.6x longer contexts with near-zero accuracy loss by compressing the KV cache using PolarQuant + QJL residuals.

from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module

# Patch mlx-lm to use TurboQuant compression
cache_module.make_prompt_cache = lambda model, **kw: [
    TurboQuantKVCache() for _ in range(len(model.layers))
]

# Now load and run as normal — context is compressed automatically
from mlx_vlm import load, generate
model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")

Without TurboQuant	With TurboQuant
8K context @ 12 GB	36K context @ ~12 GB
KV cache grows linearly	KV cache stays compressed

→ TurboQuant-MLX on GitHub · Release v2.0

🧠 Opus Reasoning + Claude Code LoRA

Supercharge this model with the Opus Reasoning + Claude Code LoRA — trained on Claude Opus 4.6 reasoning traces and Claude Code tool-use patterns.

Apply it to get structured <think>-tag chain-of-thought reasoning and agentic tool-use behavior:

from mlx_vlm import load, generate

model, processor = load(
    "deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit",
    adapter_path="deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora",
)

What it adds	Detail
Reasoning style	`<think>` tag chain-of-thought before every answer
Training data	Claude Opus 4.6 reasoning traces (2,054 examples)
Tool-use patterns	140 Claude Code agentic pattern files
Size	658 MB adapter on top of base model

→ View the adapter repo

💻 Gemini CLI — Coding Agent + Tool Orchestration

We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.

# Install
npm install -g @google/gemini-cli

# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080

# Or use directly against Gemini API (free tier: 60 req/min)
gemini

What Gemini CLI + these models unlock together

Capability	How
Code generation	Gemini CLI reads your codebase, model reasons with `<think>` tags
Tool calling	Native `<\|tool>` tokens → Gemini CLI executes shell/file/web tools
Long context	1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions
MCP servers	Connect any MCP server — databases, APIs, custom tools
Search grounding	Google Search built in — model gets live data

# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
  "Review all Python files in ./src, find potential bugs, and suggest fixes"

# Gemini CLI will: read files → call tools → model reasons → produce structured output

→ DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible

🛠️ Tool Calling (Function Calling)

Gemma 4 has native tool calling built into its chat template. Most models on HuggingFace don't support this — Gemma 4 does, using <|tool>, <|tool_call>, and <|tool_response> special tokens.

Define tools and call them

from mlx_lm import load, generate
import json

model, tokenizer = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country"},
                    "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

messages = [{"role": "user", "content": "What's the weather in San Jose, CA?"}]
prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
# Model responds with a structured tool_call in <|tool_call>...<tool_call|> format

Parse tool calls and feed results back

# After tool execution, feed the result back
messages += [
    {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": {"location": "San Jose, CA"}}}]},
    {"role": "tool", "tool_responses": [{"name": "get_weather", "response": {"temp": 72, "condition": "sunny"}}]}
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False)
final = generate(model, tokenizer, prompt=prompt, max_tokens=256)

With mlx_vlm (multimodal + tools)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")
prompt = apply_chat_template(
    processor, model.config, messages,
    tools=tools, add_generation_prompt=True
)

Tool token format (native)

Token	Purpose
`<\|tool>...<tool\|>`	Tool definition block
`<\|tool_call>call:name{args}<tool_call\|>`	Model calls a tool
`<\|tool_response>...<tool_response\|>`	Result returned to model

🦙 Ollama — One-Command Setup

Instant run (no install needed)

ollama run hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit

With a custom system prompt + tool support

Create a Modelfile:

FROM hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit

SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step."

PARAMETER temperature 0.7
PARAMETER num_ctx 8192

ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4

OpenAI-compatible endpoint

# Ollama exposes an OpenAI-compatible API automatically
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Run with mlx_lm server (native, faster on Apple Silicon)

# mlx_lm server is faster than Ollama for Apple Silicon — uses Metal GPU directly
mlx_lm.server --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --port 8080

# Then use any OpenAI client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'

Conversion Details

Source: 0xSero/gemma-4-21b-a4b-it-REAP (bfloat16, ~40 GB)
Tool: mlx_vlm.convert with --q-bits 4 --q-group-size 64 --q-mode affine
Result: ~4.8 bits/weight average, 12 GB output
Platform: Apple M4 Max 128GB

Related Models

Model	Size	Description
deadbydawn101/gemma-4-E4B-mlx-4bit	4.86 GB	Standard 4B dense MLX
deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit	3.34 GB	2B abliterated MLX
0xSero/gemma-4-21b-a4b-it-REAP	40 GB	Source bf16

License

Gemma Terms of Use — free for research and commercial use with attribution.

Converted by deadbydawn101 · RavenX AI

Downloads last month: 345

Safetensors

Model size

4B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit

Base model

0xSero/gemma-4-21b-a4b-it-REAP

Quantized

(9)

this model

Paper for deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 18