Gemma 4 19B-A4B-it REAP
30% expert-pruned version of google/gemma-4-26b-a4b-it using Cerebras REAP (Router-weighted Expert Activation Pruning).
| Original | 0.20 variant | This Model (0.30) | |
|---|---|---|---|
| Total params | ~26B | 21.34B | 19.02B |
| Experts per layer | 128 | 103 | 90 |
| Active params/tok | ~4B | ~4B | ~4B |
| Experts/tok | 8 | 8 | 8 |
| Format | BF16 | BF16 | BF16 |
| Disk size | ~52 GB | ~43 GB | ~36 GB |
REAP removes 30% of MoE experts (38 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool. This yields a ~31% reduction in total disk/memory footprint.
How This Model Was Made
Step 1: Calibration (Activation Observation)
We ran the full Gemma 4 26B-A4B-it model over a curated calibration dataset to record expert activation patterns across all layers.
Calibration dataset: 22,000 samples drawn from 12 sources:
| Category | Samples | Source Dataset |
|---|---|---|
| Coding (general) | 1,000 | theblackcat102/evol-codealpaca-v1 |
| Coding (additional) | 1,636 | theblackcat102/evol-codealpaca-v1 |
| Reasoning -- code | 3,480 | open-r1/Mixture-of-Thoughts[code] |
| Reasoning -- math | 3,578 | open-r1/Mixture-of-Thoughts[math] |
| Reasoning -- science | 3,576 | open-r1/Mixture-of-Thoughts[science] |
| Tool calling | 1,000 | Salesforce/xlam-function-calling-60k |
| Agentic coding | 1,000 | SWE-bench/SWE-smith-trajectories |
| Biomedical QA | 800 | qiaojin/PubMedQA[pqa_labeled] |
| Science QA | 800 | derek-thomas/ScienceQA |
| Grade-school math | 4,466 | openai/gsm8k[main] |
| Competition math | 500 | HuggingFaceH4/MATH-500 |
| Code correctness | 164 | evalplus/humanevalplus |
| Total | 22,000 |
Step 2: REAP Pruning
The lowest-scoring 30% of experts (38 per layer) are removed based on combined router gate values, activation norms, and frequency-weighted saliency. Router logits are renormalized post-pruning.
Pruning Configuration
| Parameter | Value |
|---|---|
| Compression ratio | 0.30 (30% expert removal) |
| Original experts per layer | 128 |
| Remaining experts per layer | 90 |
| Pruning method | REAP |
| Distance measure | Angular (cosine) |
| Router weight renormalization | Yes |
| Seed | 42 |
Benchmark Results
Accuracy (generative, 0-shot, 50 samples/task, thinking enabled, vLLM 0.19, 4x RTX 3090)
Evaluated using lm-eval generative tasks with --apply_chat_template and think_end_token=<channel|> to properly handle Gemma 4's thinking mode.
| Task | Original | REAP 0.20 | REAP 0.30 |
|---|---|---|---|
| Elementary Math | 92% | 90% | 88% |
| Philosophy | 92% | 88% | 74% |
| World Religions | 90% | 64% | 48% |
| College CS | 56% | 76% | 68% |
| HS Math | 24%* | 44%* | 48%* |
| Abstract Algebra | 12%* | 28%* | 28%* |
| College Math | 16%* | 18%* | 24%* |
* Tasks with extraction failures. Real accuracy likely higher.
Summary: At 30% pruning, the model retains strong performance on elementary math (88%) and coding (68% college CS). Knowledge-intensive tasks like philosophy and world religions see moderate drops. The 30% variant trades ~18 pts on philosophy and ~42 pts on world religions for a 31% smaller model.
Robustness: Generation Quality (14 prompts, temp=0.7, top_p=0.9, max 2048 tokens)
| Domain | N | Orig AvgWords | REAP 0.30 AvgWords | Orig Loop | REAP Loop | Orig Collapse | REAP Collapse |
|---|---|---|---|---|---|---|---|
| Coding | 3 | 670 | 558 | 0% | 0% | 0% | 0% |
| Math reasoning | 3 | 296 | 305 | 0% | 0% | 0% | 0% |
| Philosophy | 3 | 819 | 681 | 0% | 0% | 0% | 0% |
| Long context | 2 | 1210 | 864 | 50% | 0% | 0% | 0% |
| Repetition stress | 3 | 1088 | 1056 | 33% | 33% | 0% | 0% |
Zero looping, zero collapse across coding, math, and philosophy. The REAP 0.30 model actually outperforms the original on long-context (0% vs 50% loop rate). Generation quality is fully preserved at 30% compression.
Architecture
Gemma 4 uses a hybrid sliding/full attention MoE architecture:
- 30 transformer layers
- Sliding attention (window=1024) for 25 layers, full attention every 6th layer
- MoE FFN with 90 remaining experts per layer (originally 128), 8 active per token
- Thinking model -- uses
<|channel>thought/<|channel>responsechannels - Multimodal -- supports text and vision inputs
- Context window: 262,144 tokens
- Vocab size: 262,144
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "0xSero/gemma-4-19b-a4b-it-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
vLLM
pip install vllm>=0.19 transformers>=5.0
vllm serve 0xSero/gemma-4-19b-a4b-it-REAP \
--tensor-parallel-size 2 \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--max-model-len 16384 \
--trust-remote-code
Citation
@inproceedings{lasby2025reap,
title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
author={Lasby, Mike and others},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://arxiv.org/abs/2510.13999}
}
Links
- REAP paper: arxiv.org/abs/2510.13999
- REAP code: github.com/cerebras/reap
- 20% pruned variant: 0xSero/gemma-4-21b-a4b-it-REAP
- Base model: google/gemma-4-26b-a4b-it
- Downloads last month
- 256