Important Note

For IQ_KS DO NOT use mainline llama.cpp,ollama or anything that use mainline llama.cpp backend use ik_llama.cpp instead.

Q_K_M is fine thought.

Still uploading BTW!

Quantization using ik_llama.cpp 6ea7f32

Calibration data by Bartowski thank you legends!

Perplexity test using Wikitext-2 test.raw

BF16 - Final estimate: PPL over 72 chunks for n_ctx=4096 = 6.2671 +/- 0.04039
Q6_K - Final estimate: PPL over 72 chunks for n_ctx=4096 = 6.2376 +/- 0.04001
Q5_K_M - Final estimate: PPL over 72 chunks for n_ctx=4096 = 6.2564 +/- 0.04021
Q4_K_M - Final estimate: PPL over 72 chunks for n_ctx=4096 = 6.2901 +/- 0.04049
IQ4_KS - Final estimate: PPL over 72 chunks for n_ctx=4096 = 6.2921 +/- 0.04055
Q3_K_M - Final estimate: PPL over 72 chunks for n_ctx=4096 = 6.4269 +/- 0.04165
IQ3_KS - Final estimate: PPL over 72 chunks for n_ctx=4096 = 6.4566 +/- 0.04177

Holy Shit!! Why is it so low thought?! That's Lossless!! (Might got butchered in creative field tho)

Note these quant model is not coherence (perhaps for draft model? or maybe with proper system prompt could work? haven't tried instruct too):

IQ2_XS Final estimate: PPL over 72 chunks for n_ctx=4096 = 7.3814 +/- 0.04912 (Even with custom quant recipe)

Dunno what's going on somehow the Q5_K_M preplexity is lower than BF16, need to investigate.

Okay so i think... prunning noise by using imatrix calibration make the models less uncertain on making word decision that mean... yes the model are more deterministic perhaps less creative? unconfirmed but more focused?? I have no idea!

So in theory you could make your own custom calibration data that worked almost like a lora except instead of adding data you're only keeping those more aligned with your goals and discard the rest.

Maybe i'm wrong perhaps it's related to QAT or something... have no idea!