ubergarm commited on
Commit
5d375ca
ยท
0 Parent(s):

initial commit

Browse files
Files changed (2) hide show
  1. .gitattributes +38 -0
  2. README.md +191 -0
.gitattributes ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ imatrix-*.dat filter=lfs diff=lfs merge=lfs -text
37
+ *.gguf filter=lfs diff=lfs merge=lfs -text
38
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ quantized_by: ubergarm
3
+ pipeline_tag: text-generation
4
+ base_model: zai-org/GLM-5.1
5
+ base_model_relation: quantized
6
+ license: mit
7
+ tags:
8
+ - imatrix
9
+ - conversational
10
+ - glm_moe_dsa
11
+ - ik_llama.cpp
12
+ language:
13
+ - en
14
+ - zh
15
+ ---
16
+
17
+ ## WIP
18
+ - [x] download original bf16 safetensors
19
+ - [x] `convert_hf_to_gguf.py` using mainline llama.cpp
20
+ - [x] quantize `--pure` q8_0 and confirm it looks similar enough to existing GLM-5 model architechture
21
+ - [ ] run ik_llama.cpp llama-imatrix against full bf16 to get high quality imatrix
22
+ - [ ] upload imatrix so other people can begin quantizing with it as desired
23
+ - [ ] quantize/test/release `smol-IQ1_KT` and `smol-IQ2_KS` re-using previous GLM-5 recipe
24
+ - [ ] experiment some with jukofyorks patch to see if any low 4ish BPW quants seem to align with QAT and give better PPL/KLD
25
+ - [ ] potentially release some larger quants this time
26
+
27
+ ## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-5.1
28
+ *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
29
+
30
+ Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds. Also check for [ik_llama.cpp windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases).
31
+
32
+ These quants provide best in class perplexity for the given memory footprint.
33
+
34
+ ## Big Thanks
35
+ Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
36
+
37
+ Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://huggingface.co/BeaverAI) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models! Thanks to huggingface for hosting all these big quants!
38
+
39
+ Finally, I *really* appreciate the support from [aifoundry.org](https://aifoundry.org) so check out their open source RISC-V based solutions!
40
+
41
+ ## Quant Collection
42
+ Perplexity computed against *wiki.test.raw*. (lower is "better")
43
+
44
+ ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity vs Model Size.")
45
+
46
+ These two are just test quants for baseline perplexity comparison and not available for download here:
47
+ * `BF16` TODO
48
+ - PPL TODO
49
+ * `Q8_0` TODO
50
+ - PPL TODO
51
+
52
+ *NOTE*: The first split file is much smaller on purpose to only contain metadata, its fine!
53
+
54
+ ## IQ3_KS TODO
55
+ TODO
56
+
57
+ NOTE: Actual used RAM/VRAM will be about 314.07 GiB despite larger model size reported due to unused blk.78/indexer/nextn tensors.
58
+
59
+ <details>
60
+
61
+ <summary>๐Ÿ‘ˆ Secret Recipe</summary>
62
+
63
+ ```bash
64
+ TODO
65
+ ```
66
+
67
+ </details>
68
+
69
+ ## IQ2_KL TODO
70
+ TODO
71
+
72
+ NOTE: Actual used RAM/VRAM will be about 255.84 GiB despite larger model size reported due to unused blk.78/indexer/nextn tensors.
73
+
74
+ <details>
75
+
76
+ <summary>๐Ÿ‘ˆ Secret Recipe</summary>
77
+
78
+ ```bash
79
+ TODO
80
+ ```
81
+
82
+ </details>
83
+
84
+ ## smol-IQ2_KS TODO
85
+ TODO
86
+
87
+ NOTE: Actual used RAM/VRAM will be about 200 GiB despite larger model size reported due to unused blk.78/indexer/nextn tensors.
88
+
89
+ <details>
90
+
91
+ <summary>๐Ÿ‘ˆ Secret Recipe</summary>
92
+
93
+ ```bash
94
+ TODO
95
+ ```
96
+
97
+ </details>
98
+
99
+ ## smol-IQ1_KT TODO
100
+ TODO
101
+
102
+ NOTE: Actual used RAM/VRAM will be about 163.046 GiB despite larger model size reported due to unused blk.78/indexer/nextn tensors.
103
+
104
+ <details>
105
+
106
+ <summary>๐Ÿ‘ˆ Secret Recipe</summary>
107
+
108
+ ```bash
109
+ TODO
110
+ ```
111
+
112
+ </details>
113
+
114
+ ## Quick Start
115
+
116
+ ```bash
117
+ # Clone and checkout
118
+ $ git clone https://github.com/ikawrakow/ik_llama.cpp
119
+ $ cd ik_llama.cpp
120
+
121
+ # Build for hybrid CPU+CUDA
122
+ $ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
123
+ $ cmake --build build --config Release -j $(nproc)
124
+
125
+ # Download Quants
126
+ $ pip install huggingface_hub
127
+ $ hf download --local-dir ./GLM-5.1-GGUF/ --include=smol-IQ2_KS/*.gguf ubergarm/GLM-5.1-GGUF
128
+
129
+ # Hybrid CPU and Single GPU
130
+ # *NOTE* -fit might work on ik_llama.cpp now so give it a try
131
+ ./build/bin/llama-server \
132
+ --model "$model"\
133
+ --alias ubergarm/GLM-5.1 \
134
+ -muge \
135
+ --merge-qkv \
136
+ --ctx-size 131072 \
137
+ -ctk f16 \
138
+ -mla 3 \
139
+ -amb 512 \
140
+ -ngl 999 \
141
+ --n-cpu-moe 50 \
142
+ --parallel 1 \
143
+ --threads 96 \
144
+ --threads-batch 128 \
145
+ --host 127.0.0.1 \
146
+ --port 8080 \
147
+ --no-mmap \
148
+ --jinja
149
+
150
+ # CPU-Only
151
+ numactl -N ${SOCKET} -m ${SOCKET} \
152
+ ./build/bin/llama-server \
153
+ --model "$model"\
154
+ --alias ubergarm/GLM-5.1 \
155
+ -muge \
156
+ --merge-qkv \
157
+ --ctx-size 131072 \
158
+ -ctk f16 \
159
+ -mla 3 \
160
+ --parallel 1 \
161
+ --threads 96 \
162
+ --threads-batch 128 \
163
+ --numa numactl \
164
+ --host 127.0.0.1 \
165
+ --port 8080 \
166
+ --no-mmap \
167
+ --jinja
168
+ ```
169
+
170
+ You can also bring your own template with `--chat-template-file myTemplate.jinja`.
171
+
172
+ ## QAT Speculation
173
+ Assuming GLM-5.1 uses similar training as GLM-5 including INT4 QAT, there may be some tweaks to the quantization algorithm to match that target better.
174
+
175
+ > #### 2.4.3 INT4 Quantization-aware training
176
+ > To provide better accuracy at low-precision, we apply INT4 QAT in the SFT stage. Moreover, to further mitigate the training time overhead, we have developed a quantization kernel applicable to both training and offline weight quantization, which ensures bitwise-identical behavior between training and inference.
177
+ > https://arxiv.org/html/2602.15763v2
178
+
179
+ jukofyork mentioned useful links for details and experimental modified `q4_K` quantization implementation patch:
180
+
181
+ * https://github.com/zai-org/GLM-5/issues/21
182
+ * https://github.com/ywhhh/vllm-ascend-afd/blob/main/vllm_ascend/quantization/w4a8_dynamic.py
183
+ * https://github.com/ggml-org/llama.cpp/pull/17064#issuecomment-3528891329
184
+ * https://github.com/ggml-org/llama.cpp/pull/19460#issuecomment-4200617220
185
+
186
+ I may try that patch to `quantize_row_q4_0_ref()` to change `const float d = max / -8;` to `-7` similar to how we did Kimi-K2's `Q4_X` quantization type without imatrix on routed experts? Or maybe `iq4_kss` would work well here. I'll release some smaller quants first then play around with this a bit if I have time.
187
+
188
+ ## References
189
+ * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
190
+ * [Getting Started Guide (already out of date lol)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
191
+ * [ubergarm-imatrix-calibration-corpus-v02.txt](https://gist.github.com/ubergarm/edfeb3ff9c6ec8b49e88cdf627b0711a?permalink_comment_id=5682584#gistcomment-5682584)