Gemma 4 19B-A4B-it REAP

30% expert-pruned version of google/gemma-4-26b-a4b-it using Cerebras REAP (Router-weighted Expert Activation Pruning).

	Original	0.20 variant	This Model (0.30)
Total params	~26B	21.34B	19.02B
Experts per layer	128	103	90
Active params/tok	~4B	~4B	~4B
Experts/tok	8	8	8
Format	BF16	BF16	BF16
Disk size	~52 GB	~43 GB	~36 GB

REAP removes 30% of MoE experts (38 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool. This yields a ~31% reduction in total disk/memory footprint.

How This Model Was Made

Step 1: Calibration (Activation Observation)

We ran the full Gemma 4 26B-A4B-it model over a curated calibration dataset to record expert activation patterns across all layers.

Calibration dataset: 22,000 samples drawn from 12 sources:

Category	Samples	Source Dataset
Coding (general)	1,000	`theblackcat102/evol-codealpaca-v1`
Coding (additional)	1,636	`theblackcat102/evol-codealpaca-v1`
Reasoning -- code	3,480	`open-r1/Mixture-of-Thoughts[code]`
Reasoning -- math	3,578	`open-r1/Mixture-of-Thoughts[math]`
Reasoning -- science	3,576	`open-r1/Mixture-of-Thoughts[science]`
Tool calling	1,000	`Salesforce/xlam-function-calling-60k`
Agentic coding	1,000	`SWE-bench/SWE-smith-trajectories`
Biomedical QA	800	`qiaojin/PubMedQA[pqa_labeled]`
Science QA	800	`derek-thomas/ScienceQA`
Grade-school math	4,466	`openai/gsm8k[main]`
Competition math	500	`HuggingFaceH4/MATH-500`
Code correctness	164	`evalplus/humanevalplus`
Total	22,000

Step 2: REAP Pruning

The lowest-scoring 30% of experts (38 per layer) are removed based on combined router gate values, activation norms, and frequency-weighted saliency. Router logits are renormalized post-pruning.

Pruning Configuration

Parameter	Value
Compression ratio	0.30 (30% expert removal)
Original experts per layer	128
Remaining experts per layer	90
Pruning method	REAP
Distance measure	Angular (cosine)
Router weight renormalization	Yes
Seed	42

Benchmark Results

Accuracy (generative, 0-shot, 50 samples/task, thinking enabled, vLLM 0.19, 4x RTX 3090)

Evaluated using lm-eval generative tasks with --apply_chat_template and think_end_token=<channel|> to properly handle Gemma 4's thinking mode.

Task	Original	REAP 0.20	REAP 0.30
Elementary Math	92%	90%	88%
Philosophy	92%	88%	74%
World Religions	90%	64%	48%
College CS	56%	76%	68%
HS Math	24%*	44%*	48%*
Abstract Algebra	12%*	28%*	28%*
College Math	16%*	18%*	24%*

* Tasks with extraction failures. Real accuracy likely higher.

Summary: At 30% pruning, the model retains strong performance on elementary math (88%) and coding (68% college CS). Knowledge-intensive tasks like philosophy and world religions see moderate drops. The 30% variant trades ~18 pts on philosophy and ~42 pts on world religions for a 31% smaller model.

Robustness: Generation Quality (14 prompts, temp=0.7, top_p=0.9, max 2048 tokens)

Domain	N	Orig AvgWords	REAP 0.30 AvgWords	Orig Loop	REAP Loop	Orig Collapse	REAP Collapse
Coding	3	670	558	0%	0%	0%	0%
Math reasoning	3	296	305	0%	0%	0%	0%
Philosophy	3	819	681	0%	0%	0%	0%
Long context	2	1210	864	50%	0%	0%	0%
Repetition stress	3	1088	1056	33%	33%	0%	0%

Zero looping, zero collapse across coding, math, and philosophy. The REAP 0.30 model actually outperforms the original on long-context (0% vs 50% loop rate). Generation quality is fully preserved at 30% compression.

Architecture

Gemma 4 uses a hybrid sliding/full attention MoE architecture:

30 transformer layers
Sliding attention (window=1024) for 25 layers, full attention every 6th layer
MoE FFN with 90 remaining experts per layer (originally 128), 8 active per token
Thinking model -- uses <|channel>thought / <|channel>response channels
Multimodal -- supports text and vision inputs
Context window: 262,144 tokens
Vocab size: 262,144

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/gemma-4-19b-a4b-it-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

pip install vllm>=0.19 transformers>=5.0

vllm serve 0xSero/gemma-4-19b-a4b-it-REAP \
    --tensor-parallel-size 2 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384 \
    --trust-remote-code

Citation

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

Model tree for 0xSero/gemma-4-19b-a4b-it-REAP

Quantizations

4 models

Paper for 0xSero/gemma-4-19b-a4b-it-REAP

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 18

0xSero
/

gemma-4-19b-a4b-it-REAP