Quantization was performed using exllama3 v0.0.25.
| Quant | Size (GB) | Actual bpw | PPL | KL-div (q→o) | KL-div (o→q) | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|---|---|---|---|---|---|---|---|---|---|---|
| 3.0bpw | 142 | 3.00 | 3.220 | 0.0674 | 0.0776 | 91.9% | 68.4% | 44.5% | 26.6% | 14.8% |
| 3.5bpw_opt | 143 | 3.03 | 3.173 | 0.0474 | 0.0531 | 93.5% | 73.4% | 51.1% | 32.9% | 20.1% |
| 4.0bpw | 188 | 4.00 | 3.101 | 0.0203 | 0.0210 | 95.7% | 81.0% | 62.3% | 44.7% | 30.5% |
| 4.5bpw_opt | 189 | 4.03 | 3.082 | 0.0149 | 0.0153 | 96.3% | 83.9% | 67.2% | 50.7% | 36.6% |
| 5.0bpw | 234 | 5.00 | 3.067 | 0.0079 | 0.0079 | 97.3% | 87.6% | 73.9% | 59.0% | 45.3% |
| original | 751 | 16.00 | 3.053 | — | — | — | — | — | — | — |
Metrics
- PPL (Perplexity) — how well the model predicts the next token. Lower is better. The original model's PPL is the baseline.
- KL-div (Kullback-Leibler divergence) — measures how the quant's probability distribution differs from the original. Lower is better. Shown in both directions (quant→orig, orig→quant); asymmetry indicates where the quant over/under-estimates probabilities.
- Top-K agreement — probability that the quant's top-K predicted tokens match the original's top-K. Higher is better. Top-1 is the most important (does the quant pick the same best token?), higher K values show agreement across less likely candidates.
Optimized quants
Variants marked -opt are built using ExLlamaV3's layer-wise optimization. Instead of quantizing all layers to the same bitrate, the optimizer measures per-layer sensitivity and allocates bits where they matter most — critical layers get higher precision from a higher-bpw quant, while less sensitive layers stay at lower precision.
Tool Calls Support for Qwen/GLM Models
The official tabbyAPI doesn't support tool calls for Qwen and GLM models yet.
If you're using Pi Coding Agent, Qwen-Code, OpenClaw, or similar software that need tool call support, you can use my fork with the tools-support branch:
Clone directly:
git clone -b tools-support https://github.com/NeuroSenko/tabbyAPI
Or add to existing tabbyAPI installation:
git remote add neurosenko https://github.com/NeuroSenko/tabbyAPI
git fetch neurosenko
git checkout -b tools-support neurosenko/tools-support
This branch includes native tool calling support for Qwen and GLM model families.
Model tree for NeuroSenko/Qwen3.5-397B-A17B-exl3
Base model
Qwen/Qwen3.5-397B-A17B