Huihui-Qwopus3.5-27B-v3-abliterated — PolarQuant INT4

Native vLLM loading. Marlin kernel. 168 tok/s on A100.

PolarQuant Q5 intermediate produces better INT4 weights than direct quantization — then stored in CompressedTensors format for native vLLM inference.

Original PolarQuant INT4
Format BF16 safetensors CompressedTensors pack-quantized
vLLM vllm serve model vllm serve model --language-model-only
Speed ~45 tok/s 168 tok/s (Marlin kernel)
VRAM ~54 GB ~14 GB
Quality baseline cos_sim > 0.998

Quick Start — vLLM (recommended)

vllm serve caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5 --language-model-only --max-model-len 4096 --trust-remote-code

That's it. No plugin, no pip install, no custom code. vLLM detects CompressedTensors format automatically and uses Marlin kernel.

Quick Start — HuggingFace Transformers

pip install polarquant
import polarengine_vllm  # auto-registers PolarQuant with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5", trust_remote_code=True)

inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

How PolarQuant Works

Standard quantization (GPTQ, AWQ) quantizes weights directly — outliers cause large errors.

PolarQuant adds a preprocessing step that improves INT4 quality:

  1. Hadamard rotation — distributes weight energy uniformly (no outliers)
  2. Lloyd-Max Q5 — MSE-optimal 32-level quantization for the resulting Gaussian distribution
  3. Dequant to BF16 — produces cleaner weights than the original
  4. INT4 CompressedTensors — standard format, Marlin kernel, native vLLM

The result: same INT4 speed, better quality than direct GPTQ/AWQ.

Method PPL (lower = better)
BF16 baseline 6.37
PolarQuant Q5 → INT4 6.56 (+0.19)
Direct INT4 (absmax) 6.68 (+0.31)

Architecture

Spec Value
Layers 48
KV Heads 4
Head Dim 128
Quantization INT4 symmetric, group_size=128 (CompressedTensors)
Preprocessing PolarQuant Q5 (Hadamard + Lloyd-Max)
Base Model huihui-ai/Qwopus3.5-27B-v3-abliterated

Links

Downloads last month
121
Safetensors
Model size
26B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5

Paper for caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5