Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -16,9 +16,9 @@ tags:
|
|
| 16 |
|
| 17 |
# Kimi-K2.5 β 2-bit GSQ Quantization
|
| 18 |
|
| 19 |
-
This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ**
|
| 20 |
|
| 21 |
-
> **Note β Simulated quantization:**
|
| 22 |
|
| 23 |
## Model Details
|
| 24 |
|
|
@@ -38,30 +38,19 @@ This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https:
|
|
| 38 |
| Weight format | compressed-tensors, `pack-quantized` |
|
| 39 |
| Disk size | ~511 GB |
|
| 40 |
|
| 41 |
-
##
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
1. **GPTQ initialization** β A Hessian-based second-order quantizer provides an initial quantized solution and per-group scales.
|
| 46 |
-
2. **Gumbel-Softmax refinement** β Sign and mask logits for the quantized representation are optimized through a differentiable relaxation using the Gumbel-Softmax estimator, with temperature and logit-scale annealing. Optimization uses the Lion optimizer.
|
| 47 |
-
|
| 48 |
-
For Kimi-K2.5, only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1β60 are quantized. Attention projections, embeddings, layer norms, the shared expert, layer 0's dense MLP, and all vision components are kept in full bfloat16 precision.
|
| 49 |
-
|
| 50 |
-
### Quantization Configuration
|
| 51 |
|
| 52 |
-
|
|
| 53 |
-
|---|---|
|
| 54 |
-
|
|
| 55 |
-
|
|
| 56 |
-
|
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
-
| GPTQ init samples | 512 |
|
| 60 |
-
| Optimizer | Lion |
|
| 61 |
-
| Gumbel temperature schedule | 2.0 β 0.05 |
|
| 62 |
-
| Logit scale schedule | 100 β 500 |
|
| 63 |
|
| 64 |
-
### Perplexity
|
| 65 |
|
| 66 |
Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:
|
| 67 |
|
|
@@ -77,15 +66,17 @@ Evaluated on a 128-sample held-out split during quantization, measured every 6 l
|
|
| 77 |
|
| 78 |
The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).
|
| 79 |
|
| 80 |
-
##
|
|
|
|
|
|
|
| 81 |
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
|
| 90 |
## Usage
|
| 91 |
|
|
|
|
| 16 |
|
| 17 |
# Kimi-K2.5 β 2-bit GSQ Quantization
|
| 18 |
|
| 19 |
+
This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ**, a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
|
| 20 |
|
| 21 |
+
> **Note β Simulated quantization:** The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit β there is no memory or storage saving beyond INT4 in this checkpoint.
|
| 22 |
|
| 23 |
## Model Details
|
| 24 |
|
|
|
|
| 38 |
| Weight format | compressed-tensors, `pack-quantized` |
|
| 39 |
| Disk size | ~511 GB |
|
| 40 |
|
| 41 |
+
## Results
|
| 42 |
|
| 43 |
+
### Benchmark Results (lm-evaluation-harness)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
| Benchmark | Metric | Baseline (BF16) | GSQ 2-bit | Ξ |
|
| 46 |
+
|---|---|---|---|---|
|
| 47 |
+
| GSM8K | exact_match (strict) | 94.01 | 92.57 | -1.44 |
|
| 48 |
+
| ARC-Challenge | acc_norm | 70.14 | 62.97 | -7.17 |
|
| 49 |
+
| ARC-Easy | acc_norm | 88.80 | 85.10 | -3.70 |
|
| 50 |
+
| PIQA | acc_norm | 86.29 | 82.37 | -3.92 |
|
| 51 |
+
| WinoGrande | acc | 80.82 | 76.95 | -3.87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
+
### Perplexity (WikiText-2)
|
| 54 |
|
| 55 |
Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:
|
| 56 |
|
|
|
|
| 66 |
|
| 67 |
The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).
|
| 68 |
|
| 69 |
+
## Quantization Details
|
| 70 |
+
|
| 71 |
+
This model was quantized using **GSQ**, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) dataset, with group size 128.
|
| 72 |
|
| 73 |
+
Only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1β60 are quantized. The following components are kept in original precision:
|
| 74 |
+
- Attention projections (`self_attn`)
|
| 75 |
+
- Embeddings and the LM head
|
| 76 |
+
- Layer norms
|
| 77 |
+
- The shared expert
|
| 78 |
+
- Layer 0's dense MLP
|
| 79 |
+
- All vision tower and multimodal projector weights
|
| 80 |
|
| 81 |
## Usage
|
| 82 |
|