--- base_model: moonshotai/Kimi-K2.5 language: - en license: other license_name: kimi-k2.5-license license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE tags: - quantization - compressed-tensors - gsq - 2-bit - moe - multimodal --- # Kimi-K2.5 — 2-bit GSQ Quantization This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ**, a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference. > **Note — Simulated quantization:** The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint. ## Model Details | Property | Value | |---|---| | Base model | [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) | | Architecture | MoE multimodal LLM (DeepSeek-V3-style MoE) | | Transformer layers | 61 | | Routed experts | 384 (8 active per token) | | Hidden size | 7168 | | Context length | 262,144 tokens (256K) | | Total parameters | ~547B | | Quantization | 2-bit GSQ (stored as INT4-packed via compressed-tensors) | | Quantized layers | Expert FFN weights, layers 1–60 | | Group size | 128 | | Calibration dataset | [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) | | Weight format | compressed-tensors, `pack-quantized` | | Disk size | ~511 GB | ## Results ### Benchmark Results (lm-evaluation-harness) | Benchmark | Metric | Baseline (BF16) | GSQ 2-bit | Δ | |---|---|---|---|---| | GSM8K | exact_match (strict) | 94.01 | 92.57 | -1.44 | | ARC-Challenge | acc_norm | 70.14 | 62.97 | -7.17 | | ARC-Easy | acc_norm | 88.80 | 85.10 | -3.70 | | PIQA | acc_norm | 86.29 | 82.37 | -3.92 | | WinoGrande | acc | 80.82 | 76.95 | -3.87 | ### Perplexity (WikiText-2) Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed: | Checkpoint | WikiText-2 PPL | |---|---| | Dense baseline | 1.734 | | After layer 6 | 1.734 | | After layer 12 | 1.733 | | After layer 24 | 1.733 | | After layer 36 | 1.735 | | After layer 48 | 1.741 | | After layer 60 (final) | **1.749** | The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation). ## Quantization Details This model was quantized using **GSQ**, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) dataset, with group size 128. Only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1–60 are quantized. The following components are kept in original precision: - Attention projections (`self_attn`) - Embeddings and the LM head - Layer norms - The shared expert - Layer 0's dense MLP - All vision tower and multimodal projector weights ## Usage This model requires **vLLM** for inference. Because Kimi-K2.5 uses a custom model architecture (`kimi_k25`), you must pass `--trust-remote-code`. While the MoE expert weights are quantized to 2-bit, the attention, embedding, and norm weights remain in bfloat16, so the on-disk size is ~511 GB and the model still requires substantial GPU memory. In our testing, **8× NVIDIA GH200 96 GB GPUs** (2 nodes with tensor parallelism 8) are needed for serving. ### Installation ```bash pip install vllm ``` ### Serving with vLLM ```bash vllm serve daslab-testing/Kimi-K2.5-2bit-GSQ \ --trust-remote-code \ --tensor-parallel-size 8 \ --distributed-executor-backend ray \ --tokenizer-mode hf \ --mm-encoder-tp-mode data \ --max-model-len 4096 \ --gpu-memory-utilization 0.85 \ --max-num-seqs 4 ``` **Flag notes:** - `--tokenizer-mode hf`: Required to prevent garbled output on extended serving sessions (vLLM issue [#35718](https://github.com/vllm-project/vllm/issues/35718)). - `--mm-encoder-tp-mode data`: Required for Kimi-K2.5's vision encoder — ViT dimensions are not evenly divisible by the tensor-parallel size, which causes cuBLAS errors without this flag. - `--max-model-len 4096`: Adjust upward if GPU memory permits; 4096 is what was used during our testing. - `--distributed-executor-backend ray`: Required for multi-node serving. ### Offline inference with vLLM ```python from vllm import LLM, SamplingParams llm = LLM( model="daslab-testing/Kimi-K2.5-2bit-GSQ", trust_remote_code=True, tensor_parallel_size=8, tokenizer_mode="hf", mm_encoder_tp_mode="data", max_model_len=4096, gpu_memory_utilization=0.85, ) sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=1024) outputs = llm.generate(["Explain the concept of entropy in thermodynamics."], sampling_params) print(outputs[0].outputs[0].text) ``` ### Chat template Kimi-K2.5 uses its own tokenizer and chat template. Use the tokenizer bundled with this repository: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "daslab-testing/Kimi-K2.5-2bit-GSQ", trust_remote_code=True, ) messages = [{"role": "user", "content": "What is 2+2?"}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) ``` ## Limitations - This is a research quantization, not a production-ready release. Expect some quality degradation relative to the full-precision model, particularly on tasks requiring precise arithmetic or complex multi-step reasoning. - Vision/multimodal capabilities have not been evaluated post-quantization (only the language model weights were quantized). - The model uses a custom architecture; some inference frameworks other than vLLM may not support it without modification. ## License This model is derived from [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) and is subject to the same [license terms](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE). Please review those terms before use.