--- base_model: - cerebras/MiniMax-M2.5-REAP-172B-A10B license: mit license_name: modified-mit library_name: transformers pipeline_tag: text-generation language: - en - zh tags: - minimax - nvfp4 - 4-bit - quantized - compressed-tensors - vllm - DGX-Spark - GB10 - MoE - REAP --- # MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 NVFP4 quantization of [cerebras/MiniMax-M2.5-REAP-172B-A10B](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-172B-A10B) for NVIDIA DGX Spark (GB10). The base model is a [Cerebras REAP](https://www.cerebras.ai/blog/reap) (Router-weighted Expert Activation Pruning) variant of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5). REAP uniformly prunes experts from 256 → 192, reducing total parameters from 230B to 172B while maintaining near-identical performance. ## Model Details | | | |---|---| | **Base Model** | [cerebras/MiniMax-M2.5-REAP-172B-A10B](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-172B-A10B) | | **Original Model** | [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) (230B) | | **Architecture** | MiniMaxM2ForCausalLM (MoE, 192 experts, 8 active per token) | | **Total Parameters** | 172B | | **Active Parameters** | 10B per token | | **Quantization** | NVFP4 (4-bit floating point), all layers including self_attn | | **Format** | compressed-tensors (safetensors), 20 shards | | **Size on Disk** | 99 GB | | **Context Length** | 196,608 tokens (~192K) | | **License** | Modified MIT (inherited from [Cerebras REAP](https://huggingface.co/cerebras/MiniMax-M2.5-REAP-172B-A10B)) | ## Performance (Single NVIDIA DGX Spark — GB10, 128 GB) Benchmarked with [llama-benchy](https://github.com/neuralmagic/llama-benchy). | Metric | Value | |--------|-------| | **Decode throughput** | 27–29 tok/s | | **Prefill (512 tokens)** | 920 tok/s | | **Prefill (4096 tokens)** | 1,916 tok/s | | **TTFT (512 tokens)** | 490 ms | | **Max context (gpu_mem_util=0.93)** | 65,536 tokens | | **KV cache capacity** | ~127K tokens | Effective throughput with a large system prompt (~23K tokens): ~21 tok/s. ## Quantization Details - **Method:** Post-training quantization via [LLM Compressor](https://github.com/vllm-project/llm-compressor) - **Calibration Dataset:** [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (train_sft split) - **Calibration Samples:** 64 - **Max Sequence Length:** 2048 tokens - **Ignore List:** `lm_head`, `model.embed_tokens`, `re:.*block_sparse_moe\.gate$` - **Environment:** `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1` - **Hardware Used:** NVIDIA DGX Spark with CPU offloading + swap (~4.7 hours) ## Running on a Single DGX Spark This is exactly how we run and benchmark this model. One DGX Spark, nothing else. **Docker image:** [`avarok/dgx-vllm-nvfp4-kernel:v23`](https://hub.docker.com/r/avarok/dgx-vllm-nvfp4-kernel) (vLLM 0.16.0-rc2, CUDA 13.0, SM 12.1) **Download the model:** ```bash huggingface-cli download saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 \ --local-dir /opt/huggingface/models/MiniMax-M2.5-REAP-172B-NVFP4 ``` **Launch:** ```bash docker run -d --name minimax --gpus all --ipc=host \ -v /opt/huggingface/models/MiniMax-M2.5-REAP-172B-NVFP4:/models/MiniMax-M2.5-REAP-172B-NVFP4 \ -p 8000:8000 \ -e VLLM_NVFP4_GEMM_BACKEND=marlin \ -e VLLM_TEST_FORCE_FP8_MARLIN=1 \ -e VLLM_USE_FLASHINFER_MOE_FP4=0 \ -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \ -e MODEL=/models/MiniMax-M2.5-REAP-172B-NVFP4 \ -e PORT=8000 \ -e MAX_MODEL_LEN=65536 \ -e GPU_MEMORY_UTIL=0.93 \ -e "VLLM_EXTRA_ARGS=--trust-remote-code --kv-cache-dtype fp8 --attention-backend flashinfer --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think" \ avarok/dgx-vllm-nvfp4-kernel:v23 ``` Model takes ~3–4 minutes to load. Verify it's ready: ```bash curl http://localhost:8000/v1/models ``` **Test it:** ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "MiniMax-M2.5-REAP-172B-NVFP4", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01, "max_tokens": 512 }' ``` ### What the env vars do | Variable | Why | |----------|-----| | `VLLM_NVFP4_GEMM_BACKEND=marlin` | Use Marlin kernels for FP4 GEMM (FlashInfer JIT fails on Spark) | | `VLLM_TEST_FORCE_FP8_MARLIN=1` | Required for Marlin backend activation | | `VLLM_USE_FLASHINFER_MOE_FP4=0` | Disable FlashInfer for MoE FP4 (crashes with JIT ninja build) | | `VLLM_MARLIN_USE_ATOMIC_ADD=1` | Atomic adds for Marlin (stability on GB10) | | `GPU_MEMORY_UTIL=0.93` | 0.95 OOMs on Spark; 0.93 is the safe max | | `--kv-cache-dtype fp8` | FP8 KV cache saves memory, enables ~127K token capacity | | `--attention-backend flashinfer` | FlashInfer for attention (not MoE) — works fine | ### Notes - `gpu_memory_utilization=0.95` will OOM. Use 0.93. - The model serves on port 8000 with an OpenAI-compatible API. - Prefix caching is enabled by default in vLLM 0.16+. - With 65K context, you get roughly 2 concurrent sessions with large system prompts. - Tool calling requires `--enable-auto-tool-choice --tool-call-parser minimax_m2`. ### Recommended Sampling Parameters Per [MiniMax documentation](https://huggingface.co/MiniMaxAI/MiniMax-M2.5): ```json { "temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01 } ``` ## Acknowledgments - Base model by [MiniMax](https://huggingface.co/MiniMaxAI) - REAP sparse-inference pruning by [Cerebras](https://huggingface.co/cerebras) ([paper](https://arxiv.org/abs/2510.13999)) - Quantization tooling by [vLLM / LLM Compressor](https://github.com/vllm-project/llm-compressor)