Quantization Method
Hey there,
Can you talk about your quantization method? I am curious to learn more about the mixed precision setup. Can you provide any code examples?
Thanks!
Of course! My earliest attempts "cloned" exact settings from Unsloth GGUFs (https://github.com/spicyneuron/gguf-clone), and I've also tried empirical analysis to identify sensitive weights (similar to https://github.com/baa-ai/MINT).
But I eventually found that uploading config.json and safetensors.index.json to Claude / Codex, and then discussing model architecture, arrives at 90%+ of the same recommendations.
The typical process is:
- Create a "maximized" 4-bit version where all potentially sensitive layers are BF16
- Incrementally drop precision (BF16 β 8 β 6) for each trial
- Look for a clear best size / speed / quality tradeoff
For evaluation, I run 500 tasks each from hellaswag, piqa, and winogrande, as well as perplexity and throughput tests:
mlx_lm.benchmark --model "$model" --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.perplexity --model "$model" --sequence-length 1024 --seed 123
mlx_lm.evaluate --model "$model" --task hellaswag --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task piqa --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task winogrande --seed 123 --limit 500
And depending on model size, I'll sometimes pick both a "best tradeoff" and a "smallest acceptable" version.
Still awaiting maintainer review, but here's the mlx-lm PR for granular quantization overrides: https://github.com/ml-explore/mlx-lm/pull/922
Let me know if you end up trying this yourself!