Gemma 4 21B-A4B-it REAP

20% expert-pruned version of google/gemma-4-26b-a4b-it using Cerebras REAP (Router-weighted Expert Activation Pruning).

Original This Model (0.20) 0.30 variant
Total params ~26B 21.34B 19.02B
Experts per layer 128 103 90
Active params/tok ~4B ~4B ~4B
Experts/tok 8 8 8
Format BF16 BF16 BF16
Disk size ~52 GB ~43 GB ~36 GB

REAP removes 20% of MoE experts (25 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool. This yields an ~18% reduction in total disk/memory footprint.

How This Model Was Made

Step 1: Calibration (Activation Observation)

We ran the full Gemma 4 26B-A4B-it model over a curated calibration dataset to record expert activation patterns. The observer hooks capture router gate values, expert activation norms, and routing frequencies for every layer across all calibration tokens.

Calibration dataset: 22,000 samples drawn from 12 sources covering coding, reasoning, math, science, tool-calling, and agentic tasks:

Category Samples Source Dataset
Coding (general) 1,000 theblackcat102/evol-codealpaca-v1
Coding (additional) 1,636 theblackcat102/evol-codealpaca-v1
Reasoning -- code 3,480 open-r1/Mixture-of-Thoughts[code]
Reasoning -- math 3,578 open-r1/Mixture-of-Thoughts[math]
Reasoning -- science 3,576 open-r1/Mixture-of-Thoughts[science]
Tool calling 1,000 Salesforce/xlam-function-calling-60k
Agentic coding 1,000 SWE-bench/SWE-smith-trajectories
Biomedical QA 800 qiaojin/PubMedQA[pqa_labeled]
Science QA 800 derek-thomas/ScienceQA
Grade-school math 4,466 openai/gsm8k[main]
Competition math 500 HuggingFaceH4/MATH-500
Code correctness 164 evalplus/humanevalplus
Total 22,000

Step 2: REAP Pruning

Using the recorded activation data, REAP scores each expert's importance per layer by combining router gate values, expert activation norms, and frequency-weighted saliency. The lowest-scoring 20% of experts (25 per layer) are removed. Router logits are renormalized post-pruning to maintain the output distribution.

Pruning Configuration

Parameter Value
Compression ratio 0.20 (20% expert removal)
Original experts per layer 128
Remaining experts per layer 103
Pruning method REAP
Distance measure Angular (cosine)
Router weight renormalization Yes
Seed 42

Benchmark Results

Accuracy (generative, 0-shot, 50 samples/task, thinking enabled, vLLM 0.19, 4x RTX 3090)

Evaluated using lm-eval generative tasks with --apply_chat_template and think_end_token=<channel|> to properly handle Gemma 4's thinking mode. Scores extracted from model responses using regex matching.

Task Original REAP 0.20 REAP 0.30
Elementary Math 92% 90% 88%
Philosophy 92% 88% 74%
World Religions 90% 64% 48%
College CS 56% 76% 68%
HS Math 24%* 44%* 48%*
Abstract Algebra 12%* 28%* 28%*
College Math 16%* 18%* 24%*
GSM8K 86% 84% --

* Tasks with significant extraction failures (model outputs equations rather than single letters). Real accuracy likely higher for all models.

Notes:

  • Gemma 4 is a thinking model -- it reasons internally before answering. Standard loglikelihood-based benchmarks give incorrect results because the model wants to think first.
  • GSM8K uses flexible-extract which handles thinking output well.
  • College CS and math tasks show REAP sometimes outperforming the original, likely due to sampling variance at n=50.

Generation Quality: Side-by-Side (14 prompts, temp=0.7, top_p=0.9, max 2048 tokens)

Both the original and REAP 0.20 models were tested on 14 challenging prompts across coding, math, philosophy, long-context, and repetition stress with proper chat template formatting.

Domain N Orig AvgWords REAP AvgWords Orig Loop REAP Loop Orig Collapse REAP Collapse
Coding 3 670 648 0% 0% 0% 0%
Math reasoning 3 296 261 0% 0% 0% 0%
Philosophy 3 819 727 0% 0% 0% 0%
Long context 2 1210 854 50% 0% 0% 0%
Repetition stress 3 1088 1099 33% 33% 0% 0%

12/14 clean ties, 1 REAP win (long-context), 1 mutual mild loop (sorting algorithms). The REAP 0.20 model is essentially indistinguishable from the original on generation quality.

Architecture

Gemma 4 uses a hybrid sliding/full attention MoE architecture:

  • 30 transformer layers
  • Sliding attention (window=1024) for 25 layers, full attention every 6th layer
  • MoE FFN with 103 remaining experts per layer (originally 128), 8 active per token
  • Thinking model -- uses <|channel>thought / <|channel>response channels
  • Multimodal -- supports text and vision inputs
  • Context window: 262,144 tokens
  • Vocab size: 262,144

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/gemma-4-21b-a4b-it-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

pip install vllm>=0.19 transformers>=5.0

vllm serve 0xSero/gemma-4-21b-a4b-it-REAP \
    --tensor-parallel-size 2 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384 \
    --trust-remote-code

Citation

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

Links

Downloads last month
536
Safetensors
Model size
21B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/gemma-4-21b-a4b-it-REAP

Finetunes
2 models
Quantizations
9 models

Paper for 0xSero/gemma-4-21b-a4b-it-REAP