---
base_model:
  - cerebras/MiniMax-M2.5-REAP-139B-A10B
license: mit
license_name: modified-mit
library_name: transformers
pipeline_tag: text-generation
language:
  - en
  - zh
tags:
  - minimax
  - nvfp4
  - 4-bit
  - quantized
  - compressed-tensors
  - vllm
  - DGX-Spark
  - GB10
  - MoE
  - REAP
---

# MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10

NVFP4 quantization of [cerebras/MiniMax-M2.5-REAP-139B-A10B](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-139B-A10B) for NVIDIA DGX Spark (GB10).

The base model is a [Cerebras REAP](https://www.cerebras.ai/blog/reap) (Router-weighted Expert Activation Pruning) variant of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5). REAP uniformly prunes experts from 256 → 154 (40% pruning), reducing total parameters from 230B to 139B while maintaining near-identical performance. This is the more aggressively pruned sibling of the 172B (25%) variant.

## Model Details

| | |
|---|---|
| **Base Model** | [cerebras/MiniMax-M2.5-REAP-139B-A10B](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-139B-A10B) |
| **Original Model** | [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) (230B) |
| **Architecture** | MiniMaxM2ForCausalLM (MoE, 154 experts, 8 active per token) |
| **Total Parameters** | 139B |
| **Active Parameters** | 10B per token |
| **Hidden Layers** | 62 |
| **Quantization** | NVFP4 (4-bit floating point), all layers including self_attn |
| **Format** | compressed-tensors (safetensors), 17 shards |
| **Size on Disk** | 75 GB |
| **Context Length** | 196,608 tokens (~192K) |
| **License** | Modified MIT (inherited from [Cerebras REAP](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-139B-A10B)) |

## Why 139B over 172B?

| | 172B REAP | 139B REAP |
|---|---|---|
| **Expert pruning** | 25% (256 → 192) | 40% (256 → 154) |
| **NVFP4 size** | 93 GB | **75 GB** |
| **Single Spark fit** | Tight (max ~65K ctx) | **Comfortable (~90K+ ctx headroom)** |
| **Cerebras eval loss** | Baseline | ~0.5% degradation |

The 139B variant trades minimal quality for significantly more memory headroom on a single DGX Spark. With 75GB model weight vs 93GB, you gain ~18GB for KV cache — translating to substantially more context or concurrent sessions.

## Performance (Single NVIDIA DGX Spark — GB10, 128 GB)

> **TODO:** Benchmark pending — model just quantized. Will update with llama-benchy results.
>
> Expected: similar or slightly faster than 172B NVFP4 (27–29 tok/s) due to smaller model footprint.

## Quantization Details

- **Method:** Post-training quantization via [LLM Compressor](https://github.com/vllm-project/llm-compressor) (`llmcompressor` 0.10.0)
- **Scheme:** NVFP4 (compressed-tensors format)
- **Calibration Dataset:** [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (train_sft split)
- **Calibration Samples:** 64
- **Max Sequence Length:** 2048 tokens
- **Ignore List:** `lm_head`, `model.embed_tokens`, `re:.*block_sparse_moe\.gate$`
- **Environment:** `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1`
- **Hardware Used:** NVIDIA DGX Spark (CPU offloading + 300GB swap)
- **Total Quantization Time:** 4.7 hours (281 minutes)
  - Quant pipeline model load: 50 seconds (27 BF16 shards into CPU RAM — this is llmcompressor load, NOT vLLM inference)
  - Calibration forward passes + weight calibration (28,892 weights): ~2+ hours (swap-dominated)
  - Model compression: 28,892 iterations in ~60 minutes (highly variable 1–16 it/s due to swap I/O)
  - Model save: 17 shards to disk
  - Bottleneck: swap I/O throughout (260GB model on 128GB RAM + 300GB swap)

### Quantization Pipeline

The source model on HuggingFace is labeled BF16 but actually contains `float8_e4m3fn` weights with `weight_scale_inv` blocks of shape [128, 128]. A dequantization step was required before NVFP4 quantization:

1. **Download:** `cerebras/MiniMax-M2.5-REAP-139B-A10B` (131GB, 27 shards — FP8)
2. **Dequant FP8 → BF16:** Block-wise dequantization (multiply by `scale_inv`), output 260GB / 27 shards
3. **Quantize BF16 → NVFP4:** LLM Compressor oneshot with GB10-optimized ignore list
4. **Output:** 75GB / 17 shards (compressed-tensors format)

### Key Advantage Over Conservative Quants

Our quantization quantizes **all Linear layers** including `self_attn` (q/k/v projections). Conservative approaches (e.g., lukealonso's NVFP4) leave attention in BF16, wasting ~47% of per-token bandwidth. This is safe because NVFP4 calibration handles attention weight distributions well on this architecture.

### Container Setup for Quantization

```bash
# Image: avarok/dgx-vllm-nvfp4-kernel:v23 (has llmcompressor + deps)
# Override entrypoint since default launches vLLM server

docker run -d --name minimax-139b-quant \
  --gpus all --ipc=host \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-BF16-real:/workspace/input_model \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:/workspace/output_model \
  -v /opt/huggingface/models/quantize-minimax-139b.py:/workspace/quantize.py \
  -e LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1 \
  --entrypoint bash \
  avarok/dgx-vllm-nvfp4-kernel:v23 \
  -c "pip install --upgrade transformers && python /workspace/quantize.py"
```

**Important:** The `--entrypoint bash` override is required because the default entrypoint launches vLLM. The `pip install --upgrade transformers` is needed because the image ships an older transformers that doesn't support MiniMax M2 architecture.

### Swap Configuration

The 260GB BF16 model exceeds 128GB physical RAM. A 300GB swap file was created:

```bash
sudo fallocate -l 300G /opt/huggingface/swapfile
sudo chmod 600 /opt/huggingface/swapfile
sudo mkswap /opt/huggingface/swapfile
sudo swapon /opt/huggingface/swapfile
```

This causes significant I/O stalls during compression (speed drops from 16 it/s to 1 it/s when paging), but the process completes successfully.

## Running on a Single DGX Spark

**Docker image:** [`avarok/dgx-vllm-nvfp4-kernel:v23`](https://hub.docker.com/r/avarok/dgx-vllm-nvfp4-kernel) (vLLM 0.16.0-rc2, CUDA 13.0, SM 12.1)

**Download the model:**

```bash
huggingface-cli download saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10 \
  --local-dir /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4
```

**Launch:**

```bash
docker run -d --name minimax-139b --gpus all --ipc=host \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4:/models/MiniMax-M2.5-REAP-139B-NVFP4 \
  -p 8000:8000 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  -e MODEL=/models/MiniMax-M2.5-REAP-139B-NVFP4 \
  -e PORT=8000 \
  -e MAX_MODEL_LEN=131072 \
  -e GPU_MEMORY_UTIL=0.93 \
  -e "VLLM_EXTRA_ARGS=--trust-remote-code --kv-cache-dtype fp8 --attention-backend flashinfer --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think" \
  avarok/dgx-vllm-nvfp4-kernel:v23
```

> **Note:** With 75GB model weight (vs 93GB for 172B), you can likely push `MAX_MODEL_LEN` higher — 131072 should be achievable. Benchmark results will confirm exact limits.

**Test it:**

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMax-M2.5-REAP-139B-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "min_p": 0.01,
    "max_tokens": 512
  }'
```

### Environment Variables

| Variable | Why |
|----------|-----|
| `VLLM_NVFP4_GEMM_BACKEND=marlin` | Use Marlin kernels for FP4 GEMM (FlashInfer JIT fails on Spark SM121a) |
| `VLLM_TEST_FORCE_FP8_MARLIN=1` | Required for Marlin backend activation |
| `VLLM_USE_FLASHINFER_MOE_FP4=0` | Disable FlashInfer for MoE FP4 (JIT ninja build crashes) |
| `VLLM_MARLIN_USE_ATOMIC_ADD=1` | Atomic adds for Marlin (stability on GB10) |
| `GPU_MEMORY_UTIL=0.93` | 0.95 OOMs on Spark; 0.93 is the safe max |
| `--kv-cache-dtype fp8` | FP8 KV cache saves memory, enables larger context |
| `--attention-backend flashinfer` | FlashInfer for attention (not MoE) — works fine |

### Recommended Sampling Parameters

Per [MiniMax documentation](https://huggingface.co/MiniMaxAI/MiniMax-M2.5):

```json
{
  "temperature": 1.0,
  "top_p": 0.95,
  "top_k": 40,
  "min_p": 0.01
}
```

## Comparison: Our Quants vs Others

| Model | Quant | Size | Attention | tok/s (single Spark) |
|-------|-------|------|-----------|---------------------|
| **Ours — 139B REAP NVFP4** | All Linear incl. attn | **75 GB** | Quantized | **TBD** |
| **Ours — 172B REAP NVFP4** | All Linear incl. attn | 93 GB | Quantized | 28 tok/s |
| lukealonso — 139B NVFP4 | Expert MLPs only | 79 GB | BF16 (bottleneck) | ~16 tok/s |

## Related Models

- [saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10](https://huggingface.co/saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10) — Our 172B REAP NVFP4 (93GB, 28 tok/s)
- [saricles/Qwen3-Next-80B-A3B-Coder-NVFP4-GB10](https://huggingface.co/saricles/Qwen3-Next-80B-A3B-Coder-NVFP4-GB10) — Qwen3 Coder NVFP4 (62 tok/s)
- [cerebras/MiniMax-M2.5-REAP-139B-A10B](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-139B-A10B) — Source FP8 model
- [cerebras/MiniMax-M2.5-REAP-172B-A10B](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-172B-A10B) — 172B FP8 variant

## Acknowledgments

- Base model by [MiniMax](https://huggingface.co/MiniMaxAI)
- REAP sparse-inference pruning by [Cerebras](https://huggingface.co/cerebras) ([paper](https://arxiv.org/abs/2510.13999))
- Quantization tooling by [vLLM / LLM Compressor](https://github.com/vllm-project/llm-compressor)
- Quantized by [saricles](https://huggingface.co/saricles) on NVIDIA DGX Spark