Gemma 4 19B-A4B-it REAP

30% expert-pruned version of google/gemma-4-26b-a4b-it using Cerebras REAP (Router-weighted Expert Activation Pruning).

Original 0.20 variant This Model (0.30)
Total params ~26B 21.34B 19.02B
Experts per layer 128 103 90
Active params/tok ~4B ~4B ~4B
Experts/tok 8 8 8
Format BF16 BF16 BF16
Disk size ~52 GB ~43 GB ~36 GB

REAP removes 30% of MoE experts (38 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool. This yields a ~31% reduction in total disk/memory footprint.

How This Model Was Made

Step 1: Calibration (Activation Observation)

We ran the full Gemma 4 26B-A4B-it model over a curated calibration dataset to record expert activation patterns across all layers.

Calibration dataset: 22,000 samples drawn from 12 sources:

Category Samples Source Dataset
Coding (general) 1,000 theblackcat102/evol-codealpaca-v1
Coding (additional) 1,636 theblackcat102/evol-codealpaca-v1
Reasoning -- code 3,480 open-r1/Mixture-of-Thoughts[code]
Reasoning -- math 3,578 open-r1/Mixture-of-Thoughts[math]
Reasoning -- science 3,576 open-r1/Mixture-of-Thoughts[science]
Tool calling 1,000 Salesforce/xlam-function-calling-60k
Agentic coding 1,000 SWE-bench/SWE-smith-trajectories
Biomedical QA 800 qiaojin/PubMedQA[pqa_labeled]
Science QA 800 derek-thomas/ScienceQA
Grade-school math 4,466 openai/gsm8k[main]
Competition math 500 HuggingFaceH4/MATH-500
Code correctness 164 evalplus/humanevalplus
Total 22,000

Step 2: REAP Pruning

The lowest-scoring 30% of experts (38 per layer) are removed based on combined router gate values, activation norms, and frequency-weighted saliency. Router logits are renormalized post-pruning.

Pruning Configuration

Parameter Value
Compression ratio 0.30 (30% expert removal)
Original experts per layer 128
Remaining experts per layer 90
Pruning method REAP
Distance measure Angular (cosine)
Router weight renormalization Yes
Seed 42

Benchmark Results

Accuracy (generative, 0-shot, 50 samples/task, thinking enabled, vLLM 0.19, 4x RTX 3090)

Evaluated using lm-eval generative tasks with --apply_chat_template and think_end_token=<channel|> to properly handle Gemma 4's thinking mode.

Task Original REAP 0.20 REAP 0.30
Elementary Math 92% 90% 88%
Philosophy 92% 88% 74%
World Religions 90% 64% 48%
College CS 56% 76% 68%
HS Math 24%* 44%* 48%*
Abstract Algebra 12%* 28%* 28%*
College Math 16%* 18%* 24%*

* Tasks with extraction failures. Real accuracy likely higher.

Summary: At 30% pruning, the model retains strong performance on elementary math (88%) and coding (68% college CS). Knowledge-intensive tasks like philosophy and world religions see moderate drops. The 30% variant trades ~18 pts on philosophy and ~42 pts on world religions for a 31% smaller model.

Robustness: Generation Quality (14 prompts, temp=0.7, top_p=0.9, max 2048 tokens)

Domain N Orig AvgWords REAP 0.30 AvgWords Orig Loop REAP Loop Orig Collapse REAP Collapse
Coding 3 670 558 0% 0% 0% 0%
Math reasoning 3 296 305 0% 0% 0% 0%
Philosophy 3 819 681 0% 0% 0% 0%
Long context 2 1210 864 50% 0% 0% 0%
Repetition stress 3 1088 1056 33% 33% 0% 0%

Zero looping, zero collapse across coding, math, and philosophy. The REAP 0.30 model actually outperforms the original on long-context (0% vs 50% loop rate). Generation quality is fully preserved at 30% compression.

Architecture

Gemma 4 uses a hybrid sliding/full attention MoE architecture:

  • 30 transformer layers
  • Sliding attention (window=1024) for 25 layers, full attention every 6th layer
  • MoE FFN with 90 remaining experts per layer (originally 128), 8 active per token
  • Thinking model -- uses <|channel>thought / <|channel>response channels
  • Multimodal -- supports text and vision inputs
  • Context window: 262,144 tokens
  • Vocab size: 262,144

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/gemma-4-19b-a4b-it-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

pip install vllm>=0.19 transformers>=5.0

vllm serve 0xSero/gemma-4-19b-a4b-it-REAP \
    --tensor-parallel-size 2 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384 \
    --trust-remote-code

Citation

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

Links

Downloads last month
256
Safetensors
Model size
19B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/gemma-4-19b-a4b-it-REAP

Quantizations
4 models

Paper for 0xSero/gemma-4-19b-a4b-it-REAP