daslab-testing
/

Kimi-K2.5-2bit-GSQ

@@ -16,9 +16,9 @@ tags:
 # Kimi-K2.5 — 2-bit GSQ Quantization
-This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ** (Gumbel Softmax Quantization), a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
-> **Note — Simulated quantization:** GSQ optimizes quantized weight values at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint.
 ## Model Details
@@ -38,30 +38,19 @@ This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https:
 | Weight format | compressed-tensors, `pack-quantized` |
 | Disk size | ~511 GB |
-## Quantization Method: GSQ
-**GSQ** (Gumbel Softmax Quantization) is a differentiable learned post-training quantization method. It quantizes weight matrices to a low-bit discrete codebook using a two-stage pipeline applied independently to each transformer layer:
-1. **GPTQ initialization** — A Hessian-based second-order quantizer provides an initial quantized solution and per-group scales.
-2. **Gumbel-Softmax refinement** — Sign and mask logits for the quantized representation are optimized through a differentiable relaxation using the Gumbel-Softmax estimator, with temperature and logit-scale annealing. Optimization uses the Lion optimizer.
-For Kimi-K2.5, only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1–60 are quantized. Attention projections, embeddings, layer norms, the shared expert, layer 0's dense MLP, and all vision components are kept in full bfloat16 precision.
-### Quantization Configuration
-| Hyperparameter | Value |
-|---|---|
-| Bits | 2 |
-| Group size | 128 |
-| Calibration samples | 4,096 |
-| Calibration sequence length | 4,096 |
-| Training epochs per layer | 10 |
-| GPTQ init samples | 512 |
-| Optimizer | Lion |
-| Gumbel temperature schedule | 2.0 → 0.05 |
-| Logit scale schedule | 100 → 500 |
-### Perplexity Results (WikiText-2)
 Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:
@@ -77,15 +66,17 @@ Evaluated on a 128-sample held-out split during quantization, measured every 6 l
 The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).
-### Benchmark Results (lm-evaluation-harness)
-| Benchmark | Metric | Score |
-|---|---|---|
-| GSM8K | exact_match (strict) | **92.57** |
-| ARC-Challenge | acc_norm | **62.97** |
-| ARC-Easy | acc_norm | **85.10** |
-| PIQA | acc_norm | **82.37** |
-| WinoGrande | acc | **76.95** |
 ## Usage

 # Kimi-K2.5 — 2-bit GSQ Quantization
+This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ**, a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
+> **Note — Simulated quantization:** The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint.
 ## Model Details
 | Weight format | compressed-tensors, `pack-quantized` |
 | Disk size | ~511 GB |
+## Results
+### Benchmark Results (lm-evaluation-harness)
+| Benchmark | Metric | Baseline (BF16) | GSQ 2-bit | Δ |
+|---|---|---|---|---|
+| GSM8K | exact_match (strict) | 94.01 | 92.57 | -1.44 |
+| ARC-Challenge | acc_norm | 70.14 | 62.97 | -7.17 |
+| ARC-Easy | acc_norm | 88.80 | 85.10 | -3.70 |
+| PIQA | acc_norm | 86.29 | 82.37 | -3.92 |
+| WinoGrande | acc | 80.82 | 76.95 | -3.87 |
+### Perplexity (WikiText-2)
 Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:
 The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).
+## Quantization Details
+This model was quantized using **GSQ**, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) dataset, with group size 128.
+Only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1–60 are quantized. The following components are kept in original precision:
+- Attention projections (`self_attn`)
+- Embeddings and the LM head
+- Layer norms
+- The shared expert
+- Layer 0's dense MLP
+- All vision tower and multimodal projector weights
 ## Usage