soroushtabesh commited on
Commit
8fc2e19
Β·
verified Β·
1 Parent(s): 7df75db

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +22 -31
README.md CHANGED
@@ -16,9 +16,9 @@ tags:
16
 
17
  # Kimi-K2.5 β€” 2-bit GSQ Quantization
18
 
19
- This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ** (Gumbel Softmax Quantization), a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
20
 
21
- > **Note β€” Simulated quantization:** GSQ optimizes quantized weight values at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit β€” there is no memory or storage saving beyond INT4 in this checkpoint.
22
 
23
  ## Model Details
24
 
@@ -38,30 +38,19 @@ This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https:
38
  | Weight format | compressed-tensors, `pack-quantized` |
39
  | Disk size | ~511 GB |
40
 
41
- ## Quantization Method: GSQ
42
 
43
- **GSQ** (Gumbel Softmax Quantization) is a differentiable learned post-training quantization method. It quantizes weight matrices to a low-bit discrete codebook using a two-stage pipeline applied independently to each transformer layer:
44
-
45
- 1. **GPTQ initialization** β€” A Hessian-based second-order quantizer provides an initial quantized solution and per-group scales.
46
- 2. **Gumbel-Softmax refinement** β€” Sign and mask logits for the quantized representation are optimized through a differentiable relaxation using the Gumbel-Softmax estimator, with temperature and logit-scale annealing. Optimization uses the Lion optimizer.
47
-
48
- For Kimi-K2.5, only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1–60 are quantized. Attention projections, embeddings, layer norms, the shared expert, layer 0's dense MLP, and all vision components are kept in full bfloat16 precision.
49
-
50
- ### Quantization Configuration
51
 
52
- | Hyperparameter | Value |
53
- |---|---|
54
- | Bits | 2 |
55
- | Group size | 128 |
56
- | Calibration samples | 4,096 |
57
- | Calibration sequence length | 4,096 |
58
- | Training epochs per layer | 10 |
59
- | GPTQ init samples | 512 |
60
- | Optimizer | Lion |
61
- | Gumbel temperature schedule | 2.0 β†’ 0.05 |
62
- | Logit scale schedule | 100 β†’ 500 |
63
 
64
- ### Perplexity Results (WikiText-2)
65
 
66
  Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:
67
 
@@ -77,15 +66,17 @@ Evaluated on a 128-sample held-out split during quantization, measured every 6 l
77
 
78
  The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).
79
 
80
- ### Benchmark Results (lm-evaluation-harness)
 
 
81
 
82
- | Benchmark | Metric | Score |
83
- |---|---|---|
84
- | GSM8K | exact_match (strict) | **92.57** |
85
- | ARC-Challenge | acc_norm | **62.97** |
86
- | ARC-Easy | acc_norm | **85.10** |
87
- | PIQA | acc_norm | **82.37** |
88
- | WinoGrande | acc | **76.95** |
89
 
90
  ## Usage
91
 
 
16
 
17
  # Kimi-K2.5 β€” 2-bit GSQ Quantization
18
 
19
+ This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ**, a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
20
 
21
+ > **Note β€” Simulated quantization:** The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit β€” there is no memory or storage saving beyond INT4 in this checkpoint.
22
 
23
  ## Model Details
24
 
 
38
  | Weight format | compressed-tensors, `pack-quantized` |
39
  | Disk size | ~511 GB |
40
 
41
+ ## Results
42
 
43
+ ### Benchmark Results (lm-evaluation-harness)
 
 
 
 
 
 
 
44
 
45
+ | Benchmark | Metric | Baseline (BF16) | GSQ 2-bit | Ξ” |
46
+ |---|---|---|---|---|
47
+ | GSM8K | exact_match (strict) | 94.01 | 92.57 | -1.44 |
48
+ | ARC-Challenge | acc_norm | 70.14 | 62.97 | -7.17 |
49
+ | ARC-Easy | acc_norm | 88.80 | 85.10 | -3.70 |
50
+ | PIQA | acc_norm | 86.29 | 82.37 | -3.92 |
51
+ | WinoGrande | acc | 80.82 | 76.95 | -3.87 |
 
 
 
 
52
 
53
+ ### Perplexity (WikiText-2)
54
 
55
  Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:
56
 
 
66
 
67
  The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).
68
 
69
+ ## Quantization Details
70
+
71
+ This model was quantized using **GSQ**, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) dataset, with group size 128.
72
 
73
+ Only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1–60 are quantized. The following components are kept in original precision:
74
+ - Attention projections (`self_attn`)
75
+ - Embeddings and the LM head
76
+ - Layer norms
77
+ - The shared expert
78
+ - Layer 0's dense MLP
79
+ - All vision tower and multimodal projector weights
80
 
81
  ## Usage
82