Gemma 4 21B-A4B-it REAP

20% expert-pruned version of google/gemma-4-26b-a4b-it using Cerebras REAP (Router-weighted Expert Activation Pruning).

	Original	This Model (0.20)	0.30 variant
Total params	~26B	21.34B	19.02B
Experts per layer	128	103	90
Active params/tok	~4B	~4B	~4B
Experts/tok	8	8	8
Format	BF16	BF16	BF16
Disk size	~52 GB	~43 GB	~36 GB

REAP removes 20% of MoE experts (25 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool. This yields an ~18% reduction in total disk/memory footprint.

How This Model Was Made

Step 1: Calibration (Activation Observation)

We ran the full Gemma 4 26B-A4B-it model over a curated calibration dataset to record expert activation patterns. The observer hooks capture router gate values, expert activation norms, and routing frequencies for every layer across all calibration tokens.

Calibration dataset: 22,000 samples drawn from 12 sources covering coding, reasoning, math, science, tool-calling, and agentic tasks:

Category	Samples	Source Dataset
Coding (general)	1,000	`theblackcat102/evol-codealpaca-v1`
Coding (additional)	1,636	`theblackcat102/evol-codealpaca-v1`
Reasoning -- code	3,480	`open-r1/Mixture-of-Thoughts[code]`
Reasoning -- math	3,578	`open-r1/Mixture-of-Thoughts[math]`
Reasoning -- science	3,576	`open-r1/Mixture-of-Thoughts[science]`
Tool calling	1,000	`Salesforce/xlam-function-calling-60k`
Agentic coding	1,000	`SWE-bench/SWE-smith-trajectories`
Biomedical QA	800	`qiaojin/PubMedQA[pqa_labeled]`
Science QA	800	`derek-thomas/ScienceQA`
Grade-school math	4,466	`openai/gsm8k[main]`
Competition math	500	`HuggingFaceH4/MATH-500`
Code correctness	164	`evalplus/humanevalplus`
Total	22,000

Step 2: REAP Pruning

Using the recorded activation data, REAP scores each expert's importance per layer by combining router gate values, expert activation norms, and frequency-weighted saliency. The lowest-scoring 20% of experts (25 per layer) are removed. Router logits are renormalized post-pruning to maintain the output distribution.

Pruning Configuration

Parameter	Value
Compression ratio	0.20 (20% expert removal)
Original experts per layer	128
Remaining experts per layer	103
Pruning method	REAP
Distance measure	Angular (cosine)
Router weight renormalization	Yes
Seed	42

Benchmark Results

Accuracy (generative, 0-shot, 50 samples/task, thinking enabled, vLLM 0.19, 4x RTX 3090)

Evaluated using lm-eval generative tasks with --apply_chat_template and think_end_token=<channel|> to properly handle Gemma 4's thinking mode. Scores extracted from model responses using regex matching.

Task	Original	REAP 0.20	REAP 0.30
Elementary Math	92%	90%	88%
Philosophy	92%	88%	74%
World Religions	90%	64%	48%
College CS	56%	76%	68%
HS Math	24%*	44%*	48%*
Abstract Algebra	12%*	28%*	28%*
College Math	16%*	18%*	24%*
GSM8K	86%	84%	--

* Tasks with significant extraction failures (model outputs equations rather than single letters). Real accuracy likely higher for all models.

Notes:

Gemma 4 is a thinking model -- it reasons internally before answering. Standard loglikelihood-based benchmarks give incorrect results because the model wants to think first.
GSM8K uses flexible-extract which handles thinking output well.
College CS and math tasks show REAP sometimes outperforming the original, likely due to sampling variance at n=50.

Generation Quality: Side-by-Side (14 prompts, temp=0.7, top_p=0.9, max 2048 tokens)

Both the original and REAP 0.20 models were tested on 14 challenging prompts across coding, math, philosophy, long-context, and repetition stress with proper chat template formatting.

Domain	N	Orig AvgWords	REAP AvgWords	Orig Loop	REAP Loop	Orig Collapse	REAP Collapse
Coding	3	670	648	0%	0%	0%	0%
Math reasoning	3	296	261	0%	0%	0%	0%
Philosophy	3	819	727	0%	0%	0%	0%
Long context	2	1210	854	50%	0%	0%	0%
Repetition stress	3	1088	1099	33%	33%	0%	0%

12/14 clean ties, 1 REAP win (long-context), 1 mutual mild loop (sorting algorithms). The REAP 0.20 model is essentially indistinguishable from the original on generation quality.

Architecture

Gemma 4 uses a hybrid sliding/full attention MoE architecture:

30 transformer layers
Sliding attention (window=1024) for 25 layers, full attention every 6th layer
MoE FFN with 103 remaining experts per layer (originally 128), 8 active per token
Thinking model -- uses <|channel>thought / <|channel>response channels
Multimodal -- supports text and vision inputs
Context window: 262,144 tokens
Vocab size: 262,144

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/gemma-4-21b-a4b-it-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

pip install vllm>=0.19 transformers>=5.0

vllm serve 0xSero/gemma-4-21b-a4b-it-REAP \
    --tensor-parallel-size 2 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384 \
    --trust-remote-code

Citation

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

Model tree for 0xSero/gemma-4-21b-a4b-it-REAP

Finetunes

2 models

Quantizations

9 models

Paper for 0xSero/gemma-4-21b-a4b-it-REAP

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 18

0xSero
/

gemma-4-21b-a4b-it-REAP