PolarQuant Models
Collection
Optimal Gaussian quantization via Hadamard rotation. Beats torchao INT4 on PPL. arXiv: 2603.7424577 • 25 items • Updated • 1
Native vLLM loading. Marlin kernel. 168 tok/s on A100.
PolarQuant Q5 intermediate produces better INT4 weights than direct quantization — then stored in CompressedTensors format for native vLLM inference.
| Original | PolarQuant INT4 | |
|---|---|---|
| Format | BF16 safetensors | CompressedTensors pack-quantized |
| vLLM | vllm serve model |
vllm serve model --language-model-only |
| Speed | ~45 tok/s | 168 tok/s (Marlin kernel) |
| VRAM | ~54 GB | ~14 GB |
| Quality | baseline | cos_sim > 0.998 |
vllm serve caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5 --language-model-only --max-model-len 4096 --trust-remote-code
That's it. No plugin, no pip install, no custom code. vLLM detects CompressedTensors format automatically and uses Marlin kernel.
pip install polarquant
import polarengine_vllm # auto-registers PolarQuant with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5", trust_remote_code=True)
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Standard quantization (GPTQ, AWQ) quantizes weights directly — outliers cause large errors.
PolarQuant adds a preprocessing step that improves INT4 quality:
The result: same INT4 speed, better quality than direct GPTQ/AWQ.
| Method | PPL (lower = better) |
|---|---|
| BF16 baseline | 6.37 |
| PolarQuant Q5 → INT4 | 6.56 (+0.19) |
| Direct INT4 (absmax) | 6.68 (+0.31) |
| Spec | Value |
|---|---|
| Layers | 48 |
| KV Heads | 4 |
| Head Dim | 128 |
| Quantization | INT4 symmetric, group_size=128 (CompressedTensors) |
| Preprocessing | PolarQuant Q5 (Hadamard + Lloyd-Max) |
| Base Model | huihui-ai/Qwopus3.5-27B-v3-abliterated |
pip install polarquantpolarquant export-ct model_id (convert any PQ5 model to CompressedTensors)