Unreleased

#6
by jpsequeira - opened

Hey,

Could you explain why you have in your perplexity chart quants with the unreleased label?
IQ4_NL seems to be the best you have there but it's unreleased, any way to get a hold of that quant?

Thanks

Could you explain why you have in your perplexity chart quants with the unreleased label?

I made some test quants and used them for benchmarking relative performance. Even if not released, it gives some relative quality comparisons.

IQ4_NL seems to be the best you have there but it's unreleased, any way to get a hold of that quant?

I didn't upload the larger models to save space on my public repo quota. and in general I focus on releasing ik_llama.cpp exclusive quants.

Check out: https://huggingface.co/AesSedai/GLM-5-GGUF as AesSedai uses similar style recipes as me focusing on mainline MoEs.

Can you remember if you used the imatrix for your IQ4_NL version?

Just testing a custom version of Q4_K that should work better with the QAT they used and got:

Final estimate: PPL over 565 chunks for n_ctx=512 = 2.6677 +/- 0.01420

(no imatrix used)

@jukofyork

GLM-5.1 just landed!!

Looking at my logs, yes I used the imatrix in this repo with my IQ4_NL.

Here is my perplexity logs for that run:

$ grep -E '(Final|model size)' perplexity-GLM-5-smol-IQ4_NL.log
llm_load_print_meta: model size       = 405.502 GiB (4.621 BPW)
Final estimate: PPL over 565 chunks for n_ctx=512 = 2.6730 +/- 0.01422

Here is the exact recipe I used for that one:

👈 Details
#!/usr/bin/env bash

custom="
# 79 Repeating Layers [0-78]

## Attention [0-78]
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq6_k
blk\..*\.ffn_(gate|up)\.weight=iq6_k

# Shared Expert Layers [3-78]
blk\..*\.ffn_down_shexp\.weight=iq6_k
blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k

# Routed Experts Layers [3-78]
# NOTE: blk.78.* NOT implemented at time of quantizing so no imatrix data available
blk\.(78)\.ffn_down_exps\.weight=iq6_k
blk\.(78)\.ffn_(gate|up)_exps\.weight=iq6_k
blk\..*\.ffn_down_exps\.weight=iq4_nl
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_nl

# Lightning indexer tensors [0-78]
# NOTE: indexer.* NOT implemented at time of quantizing so no imatrix data available
blk\..*\.indexer\.proj\.weight=q8_0
blk\..*\.indexer\.attn_k\.weight=q8_0
blk\..*\.indexer\.attn_q_b\.weight=q8_0

# NextN MTP Layer [78]
# NOTE: nextn.* NOT implemented at time of quantizing so no imatrix data available
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/GLM-5-GGUF/imatrix-GLM-5-BF16.dat \
    /mnt/data/models/ubergarm/GLM-5-GGUF/GLM-256x22B-5-BF16-00001-of-00033.gguf \
    /mnt/data/models/ubergarm/GLM-5-GGUF/GLM-5-smol-IQ4_NL.gguf \
    IQ4_NL \
    128

Thanks, so maybe my custom Q4_K code targeted at their QAT is working then.

I kinda expected it to be closer to BF16 after Kimi-K2-Thinking, but realised we never actually got the "real" BF16 of that model and could only compare back with the INT4 model they gave us!

I'm away from home and will take me several days to download GLM-5.1, but I posted the custom Q4_K code here:

https://github.com/ggml-org/llama.cpp/pull/19460#issuecomment-4200617220

GLM-5.1 just landed!!

Awaiting your magic 😀

Sign up or log in to comment