Quantization Method

#1
by x-polyglot-x - opened

Hey there,

Can you talk about your quantization method? I am curious to learn more about the mixed precision setup. Can you provide any code examples?

Thanks!

Of course! My earliest attempts "cloned" exact settings from Unsloth GGUFs (https://github.com/spicyneuron/gguf-clone), and I've also tried empirical analysis to identify sensitive weights (similar to https://github.com/baa-ai/MINT).

But I eventually found that uploading config.json and safetensors.index.json to Claude / Codex, and then discussing model architecture, arrives at 90%+ of the same recommendations.

The typical process is:

  • Create a "maximized" 4-bit version where all potentially sensitive layers are BF16
  • Incrementally drop precision (BF16 β†’ 8 β†’ 6) for each trial
  • Look for a clear best size / speed / quality tradeoff

For evaluation, I run 500 tasks each from hellaswag, piqa, and winogrande, as well as perplexity and throughput tests:

mlx_lm.benchmark --model "$model" --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.perplexity --model "$model" --sequence-length 1024 --seed 123
mlx_lm.evaluate --model "$model" --task hellaswag --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task piqa --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task winogrande --seed 123 --limit 500

And depending on model size, I'll sometimes pick both a "best tradeoff" and a "smallest acceptable" version.

Still awaiting maintainer review, but here's the mlx-lm PR for granular quantization overrides: https://github.com/ml-explore/mlx-lm/pull/922

Let me know if you end up trying this yourself!

Sign up or log in to comment