Gemma 4 21B REAP — Tool Calling ✅ | 103 Experts | MLX 4-bit | Apple Silicon
The biggest open Gemma 4 on Apple Silicon. 21B MoE, REAP-pruned to 4 active experts, native tool calling, 12 GB MLX 4-bit. TurboQuant ready.
0xSero/gemma-4-21b-a4b-it-REAP converted to MLX 4-bit (affine, group_size=64) for native Apple Silicon inference.
This is Gemma 4 27B MoE pruned from 103 experts down to 4 active experts per token using REAP (Routing Expert Activation Pruning) — yielding a model that runs at ~21B total params but activates only a fraction per forward pass, combining large capacity with fast inference.
🖤 12 GB MLX 4-bit — runs on any M-series Mac with 24GB+ unified memory.
Multimodal: text + vision. 131K context window.
What is REAP?
REAP (Routing Expert Activation Pruning) is a technique from Cerebras that prunes MoE experts by analyzing routing patterns. Instead of activating many experts per token, REAP identifies which experts are actually essential and prunes the rest — resulting in:
- Fewer experts activated per token (4 active out of 103 total)
- Faster inference due to reduced compute per forward pass
- Minimal quality loss — BoolQ accuracy 76%, HellaSwag 46% (see evals below)
Model Details
| Property | Value |
|---|---|
| Base model | 0xSero/gemma-4-21b-a4b-it-REAP |
| Original base | google/gemma-4-27b-it (MoE) |
| Architecture | Gemma4ForConditionalGeneration (MoE) |
| Total parameters | ~21B |
| Total experts | 103 |
| Active experts/token | 4 (REAP-pruned) |
| Modalities | Text · Vision |
| Quantization | 4-bit affine, group_size=64, ~4.8 bits/weight |
| File size | 12 GB (down from ~40 GB bf16) |
| Context window | 131,072 tokens |
| Vocab size | 262,144 |
Evaluation (from source model)
| Benchmark | Score |
|---|---|
| BoolQ | 76% |
| HellaSwag | 46% |
| ARC-Challenge | 28% |
Performance (Apple Silicon)
| Chip | RAM | Tok/sec (est) |
|---|---|---|
| M4 Max 128GB | 128GB | ~20–30 tok/s |
| M3 Ultra 192GB | 192GB | ~25–35 tok/s |
| M2 Ultra 192GB | 192GB | ~18–25 tok/s |
Requires at least 24GB unified memory. 32GB+ recommended for comfortable operation.
Quickstart
Install
pip install mlx-lm mlx-vlm
Text generation
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")
messages = [{"role": "user", "content": "Explain mixture-of-experts models simply."}]
prompt = apply_chat_template(processor, model.config, messages)
response = generate(model, processor, prompt=prompt, max_tokens=512, verbose=True)
Vision (image + text)
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": "https://example.com/photo.jpg"},
{"type": "text", "text": "Describe this image."}
]
}]
prompt = apply_chat_template(processor, model.config, messages, add_generation_prompt=True)
response = generate(model, processor, prompt=prompt, max_tokens=512)
CLI
mlx_vlm.generate \
--model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit \
--prompt "What are the key differences between MoE and dense transformer models?" \
--max-tokens 512
⚡ TurboQuant-MLX — 4.6x KV Cache Compression
Pair this model with TurboQuant-MLX — RavenX AI's Apple Silicon KV cache compression. Run 4.6x longer contexts with near-zero accuracy loss by compressing the KV cache using PolarQuant + QJL residuals.
from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module
# Patch mlx-lm to use TurboQuant compression
cache_module.make_prompt_cache = lambda model, **kw: [
TurboQuantKVCache() for _ in range(len(model.layers))
]
# Now load and run as normal — context is compressed automatically
from mlx_vlm import load, generate
model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")
| Without TurboQuant | With TurboQuant |
|---|---|
| 8K context @ 12 GB | 36K context @ ~12 GB |
| KV cache grows linearly | KV cache stays compressed |
→ TurboQuant-MLX on GitHub · Release v2.0
🧠 Opus Reasoning + Claude Code LoRA
Supercharge this model with the Opus Reasoning + Claude Code LoRA — trained on Claude Opus 4.6 reasoning traces and Claude Code tool-use patterns.
Apply it to get structured <think>-tag chain-of-thought reasoning and agentic tool-use behavior:
from mlx_vlm import load, generate
model, processor = load(
"deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit",
adapter_path="deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora",
)
| What it adds | Detail |
|---|---|
| Reasoning style | <think> tag chain-of-thought before every answer |
| Training data | Claude Opus 4.6 reasoning traces (2,054 examples) |
| Tool-use patterns | 140 Claude Code agentic pattern files |
| Size | 658 MB adapter on top of base model |
💻 Gemini CLI — Coding Agent + Tool Orchestration
We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.
Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.
# Install
npm install -g @google/gemini-cli
# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080
# Or use directly against Gemini API (free tier: 60 req/min)
gemini
What Gemini CLI + these models unlock together
| Capability | How |
|---|---|
| Code generation | Gemini CLI reads your codebase, model reasons with <think> tags |
| Tool calling | Native <|tool> tokens → Gemini CLI executes shell/file/web tools |
| Long context | 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions |
| MCP servers | Connect any MCP server — databases, APIs, custom tools |
| Search grounding | Google Search built in — model gets live data |
# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
"Review all Python files in ./src, find potential bugs, and suggest fixes"
# Gemini CLI will: read files → call tools → model reasons → produce structured output
→ DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible
🛠️ Tool Calling (Function Calling)
Gemma 4 has native tool calling built into its chat template. Most models on HuggingFace don't support this — Gemma 4 does, using <|tool>, <|tool_call>, and <|tool_response> special tokens.
Define tools and call them
from mlx_lm import load, generate
import json
model, tokenizer = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and country"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}
]
messages = [{"role": "user", "content": "What's the weather in San Jose, CA?"}]
prompt = tokenizer.apply_chat_template(
messages,
tools=tools,
add_generation_prompt=True,
tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
# Model responds with a structured tool_call in <|tool_call>...<tool_call|> format
Parse tool calls and feed results back
# After tool execution, feed the result back
messages += [
{"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": {"location": "San Jose, CA"}}}]},
{"role": "tool", "tool_responses": [{"name": "get_weather", "response": {"temp": 72, "condition": "sunny"}}]}
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False)
final = generate(model, tokenizer, prompt=prompt, max_tokens=256)
With mlx_vlm (multimodal + tools)
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit")
prompt = apply_chat_template(
processor, model.config, messages,
tools=tools, add_generation_prompt=True
)
Tool token format (native)
| Token | Purpose |
|---|---|
<|tool>...<tool|> |
Tool definition block |
<|tool_call>call:name{args}<tool_call|> |
Model calls a tool |
<|tool_response>...<tool_response|> |
Result returned to model |
🦙 Ollama — One-Command Setup
Instant run (no install needed)
ollama run hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit
With a custom system prompt + tool support
Create a Modelfile:
FROM hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit
SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step."
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4
OpenAI-compatible endpoint
# Ollama exposes an OpenAI-compatible API automatically
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hf.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Run with mlx_lm server (native, faster on Apple Silicon)
# mlx_lm server is faster than Ollama for Apple Silicon — uses Metal GPU directly
mlx_lm.server --model deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit --port 8080
# Then use any OpenAI client
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'
Conversion Details
- Source:
0xSero/gemma-4-21b-a4b-it-REAP(bfloat16, ~40 GB) - Tool:
mlx_vlm.convertwith--q-bits 4 --q-group-size 64 --q-mode affine - Result: ~4.8 bits/weight average, 12 GB output
- Platform: Apple M4 Max 128GB
Related Models
| Model | Size | Description |
|---|---|---|
| deadbydawn101/gemma-4-E4B-mlx-4bit | 4.86 GB | Standard 4B dense MLX |
| deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit | 3.34 GB | 2B abliterated MLX |
| 0xSero/gemma-4-21b-a4b-it-REAP | 40 GB | Source bf16 |
License
Gemma Terms of Use — free for research and commercial use with attribution.
Converted by deadbydawn101 · RavenX AI
- Downloads last month
- 345
4-bit
Model tree for deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit
Base model
0xSero/gemma-4-21b-a4b-it-REAP