--- base_model: - cerebras/MiniMax-M2.5-REAP-139B-A10B license: mit license_name: modified-mit library_name: transformers pipeline_tag: text-generation language: - en - zh tags: - minimax - nvfp4 - 4-bit - quantized - compressed-tensors - vllm - DGX-Spark - GB10 - MoE - REAP --- # MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10 NVFP4 quantization of [cerebras/MiniMax-M2.5-REAP-139B-A10B](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-139B-A10B) for NVIDIA DGX Spark (GB10). The base model is a [Cerebras REAP](https://www.cerebras.ai/blog/reap) (Router-weighted Expert Activation Pruning) variant of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5). REAP uniformly prunes experts from 256 → 154 (40% pruning), reducing total parameters from 230B to 139B while maintaining near-identical performance. This is the more aggressively pruned sibling of the 172B (25%) variant. ## Model Details | | | |---|---| | **Base Model** | [cerebras/MiniMax-M2.5-REAP-139B-A10B](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-139B-A10B) | | **Original Model** | [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) (230B) | | **Architecture** | MiniMaxM2ForCausalLM (MoE, 154 experts, 8 active per token) | | **Total Parameters** | 139B | | **Active Parameters** | 10B per token | | **Hidden Layers** | 62 | | **Quantization** | NVFP4 (4-bit floating point), all layers including self_attn | | **Format** | compressed-tensors (safetensors), 17 shards | | **Size on Disk** | 75 GB | | **Context Length** | 196,608 tokens (~192K) | | **License** | Modified MIT (inherited from [Cerebras REAP](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-139B-A10B)) | ## Why 139B over 172B? | | 172B REAP | 139B REAP | |---|---|---| | **Expert pruning** | 25% (256 → 192) | 40% (256 → 154) | | **NVFP4 size** | 93 GB | **75 GB** | | **Single Spark fit** | Tight (max ~65K ctx) | **Comfortable (~90K+ ctx headroom)** | | **Cerebras eval loss** | Baseline | ~0.5% degradation | The 139B variant trades minimal quality for significantly more memory headroom on a single DGX Spark. With 75GB model weight vs 93GB, you gain ~18GB for KV cache — translating to substantially more context or concurrent sessions. ## Performance (Single NVIDIA DGX Spark — GB10, 128 GB) > **TODO:** Benchmark pending — model just quantized. Will update with llama-benchy results. > > Expected: similar or slightly faster than 172B NVFP4 (27–29 tok/s) due to smaller model footprint. ## Quantization Details - **Method:** Post-training quantization via [LLM Compressor](https://github.com/vllm-project/llm-compressor) (`llmcompressor` 0.10.0) - **Scheme:** NVFP4 (compressed-tensors format) - **Calibration Dataset:** [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (train_sft split) - **Calibration Samples:** 64 - **Max Sequence Length:** 2048 tokens - **Ignore List:** `lm_head`, `model.embed_tokens`, `re:.*block_sparse_moe\.gate$` - **Environment:** `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1` - **Hardware Used:** NVIDIA DGX Spark (CPU offloading + 300GB swap) - **Total Quantization Time:** 4.7 hours (281 minutes) - Quant pipeline model load: 50 seconds (27 BF16 shards into CPU RAM — this is llmcompressor load, NOT vLLM inference) - Calibration forward passes + weight calibration (28,892 weights): ~2+ hours (swap-dominated) - Model compression: 28,892 iterations in ~60 minutes (highly variable 1–16 it/s due to swap I/O) - Model save: 17 shards to disk - Bottleneck: swap I/O throughout (260GB model on 128GB RAM + 300GB swap) ### Quantization Pipeline The source model on HuggingFace is labeled BF16 but actually contains `float8_e4m3fn` weights with `weight_scale_inv` blocks of shape [128, 128]. A dequantization step was required before NVFP4 quantization: 1. **Download:** `cerebras/MiniMax-M2.5-REAP-139B-A10B` (131GB, 27 shards — FP8) 2. **Dequant FP8 → BF16:** Block-wise dequantization (multiply by `scale_inv`), output 260GB / 27 shards 3. **Quantize BF16 → NVFP4:** LLM Compressor oneshot with GB10-optimized ignore list 4. **Output:** 75GB / 17 shards (compressed-tensors format) ### Key Advantage Over Conservative Quants Our quantization quantizes **all Linear layers** including `self_attn` (q/k/v projections). Conservative approaches (e.g., lukealonso's NVFP4) leave attention in BF16, wasting ~47% of per-token bandwidth. This is safe because NVFP4 calibration handles attention weight distributions well on this architecture. ### Container Setup for Quantization ```bash # Image: avarok/dgx-vllm-nvfp4-kernel:v23 (has llmcompressor + deps) # Override entrypoint since default launches vLLM server docker run -d --name minimax-139b-quant \ --gpus all --ipc=host \ -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-BF16-real:/workspace/input_model \ -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:/workspace/output_model \ -v /opt/huggingface/models/quantize-minimax-139b.py:/workspace/quantize.py \ -e LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1 \ --entrypoint bash \ avarok/dgx-vllm-nvfp4-kernel:v23 \ -c "pip install --upgrade transformers && python /workspace/quantize.py" ``` **Important:** The `--entrypoint bash` override is required because the default entrypoint launches vLLM. The `pip install --upgrade transformers` is needed because the image ships an older transformers that doesn't support MiniMax M2 architecture. ### Swap Configuration The 260GB BF16 model exceeds 128GB physical RAM. A 300GB swap file was created: ```bash sudo fallocate -l 300G /opt/huggingface/swapfile sudo chmod 600 /opt/huggingface/swapfile sudo mkswap /opt/huggingface/swapfile sudo swapon /opt/huggingface/swapfile ``` This causes significant I/O stalls during compression (speed drops from 16 it/s to 1 it/s when paging), but the process completes successfully. ## Running on a Single DGX Spark **Docker image:** [`avarok/dgx-vllm-nvfp4-kernel:v23`](https://hub.docker.com/r/avarok/dgx-vllm-nvfp4-kernel) (vLLM 0.16.0-rc2, CUDA 13.0, SM 12.1) **Download the model:** ```bash huggingface-cli download saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10 \ --local-dir /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4 ``` **Launch:** ```bash docker run -d --name minimax-139b --gpus all --ipc=host \ -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4:/models/MiniMax-M2.5-REAP-139B-NVFP4 \ -p 8000:8000 \ -e VLLM_NVFP4_GEMM_BACKEND=marlin \ -e VLLM_TEST_FORCE_FP8_MARLIN=1 \ -e VLLM_USE_FLASHINFER_MOE_FP4=0 \ -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \ -e MODEL=/models/MiniMax-M2.5-REAP-139B-NVFP4 \ -e PORT=8000 \ -e MAX_MODEL_LEN=131072 \ -e GPU_MEMORY_UTIL=0.93 \ -e "VLLM_EXTRA_ARGS=--trust-remote-code --kv-cache-dtype fp8 --attention-backend flashinfer --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think" \ avarok/dgx-vllm-nvfp4-kernel:v23 ``` > **Note:** With 75GB model weight (vs 93GB for 172B), you can likely push `MAX_MODEL_LEN` higher — 131072 should be achievable. Benchmark results will confirm exact limits. **Test it:** ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "MiniMax-M2.5-REAP-139B-NVFP4", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01, "max_tokens": 512 }' ``` ### Environment Variables | Variable | Why | |----------|-----| | `VLLM_NVFP4_GEMM_BACKEND=marlin` | Use Marlin kernels for FP4 GEMM (FlashInfer JIT fails on Spark SM121a) | | `VLLM_TEST_FORCE_FP8_MARLIN=1` | Required for Marlin backend activation | | `VLLM_USE_FLASHINFER_MOE_FP4=0` | Disable FlashInfer for MoE FP4 (JIT ninja build crashes) | | `VLLM_MARLIN_USE_ATOMIC_ADD=1` | Atomic adds for Marlin (stability on GB10) | | `GPU_MEMORY_UTIL=0.93` | 0.95 OOMs on Spark; 0.93 is the safe max | | `--kv-cache-dtype fp8` | FP8 KV cache saves memory, enables larger context | | `--attention-backend flashinfer` | FlashInfer for attention (not MoE) — works fine | ### Recommended Sampling Parameters Per [MiniMax documentation](https://huggingface.co/MiniMaxAI/MiniMax-M2.5): ```json { "temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01 } ``` ## Comparison: Our Quants vs Others | Model | Quant | Size | Attention | tok/s (single Spark) | |-------|-------|------|-----------|---------------------| | **Ours — 139B REAP NVFP4** | All Linear incl. attn | **75 GB** | Quantized | **TBD** | | **Ours — 172B REAP NVFP4** | All Linear incl. attn | 93 GB | Quantized | 28 tok/s | | lukealonso — 139B NVFP4 | Expert MLPs only | 79 GB | BF16 (bottleneck) | ~16 tok/s | ## Related Models - [saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10](https://huggingface.co/saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10) — Our 172B REAP NVFP4 (93GB, 28 tok/s) - [saricles/Qwen3-Next-80B-A3B-Coder-NVFP4-GB10](https://huggingface.co/saricles/Qwen3-Next-80B-A3B-Coder-NVFP4-GB10) — Qwen3 Coder NVFP4 (62 tok/s) - [cerebras/MiniMax-M2.5-REAP-139B-A10B](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-139B-A10B) — Source FP8 model - [cerebras/MiniMax-M2.5-REAP-172B-A10B](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-172B-A10B) — 172B FP8 variant ## Acknowledgments - Base model by [MiniMax](https://huggingface.co/MiniMaxAI) - REAP sparse-inference pruning by [Cerebras](https://huggingface.co/cerebras) ([paper](https://arxiv.org/abs/2510.13999)) - Quantization tooling by [vLLM / LLM Compressor](https://github.com/vllm-project/llm-compressor) - Quantized by [saricles](https://huggingface.co/saricles) on NVIDIA DGX Spark