Quantization Method

by x-polyglot-x - opened 6 days ago

Hey there,

Can you talk about your quantization method? I am curious to learn more about the mixed precision setup. Can you provide any code examples?

Thanks!

spicyneuron

Owner 5 days ago

Of course! My earliest attempts "cloned" exact settings from Unsloth GGUFs (https://github.com/spicyneuron/gguf-clone), and I've also tried empirical analysis to identify sensitive weights (similar to https://github.com/baa-ai/MINT).

But I eventually found that uploading config.json and safetensors.index.json to Claude / Codex, and then discussing model architecture, arrives at 90%+ of the same recommendations.

The typical process is:

Create a "maximized" 4-bit version where all potentially sensitive layers are BF16
Incrementally drop precision (BF16 → 8 → 6) for each trial
Look for a clear best size / speed / quality tradeoff

For evaluation, I run 500 tasks each from hellaswag, piqa, and winogrande, as well as perplexity and throughput tests:

mlx_lm.benchmark --model "$model" --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.perplexity --model "$model" --sequence-length 1024 --seed 123
mlx_lm.evaluate --model "$model" --task hellaswag --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task piqa --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task winogrande --seed 123 --limit 500

And depending on model size, I'll sometimes pick both a "best tradeoff" and a "smallest acceptable" version.

Still awaiting maintainer review, but here's the mlx-lm PR for granular quantization overrides: https://github.com/ml-explore/mlx-lm/pull/922

Let me know if you end up trying this yourself!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment