GLM-4.7-Flash — PolarQuant Q5 (Bit-Packed)

PQ5+INT4 weights for consumer GPU inference.

30B-A3B MoE | MLA attention | MIT license | 22.2 tok/s

61 GB → 19 GB (-69%) | cos_sim >0.998 | 6,265 layers quantized

Download Size

Compression

Component	Layers	Original	PQ5 Packed
nn.Linear (INT4)	377	~8 GB	1.8 GB
MoE Experts	5,888 slices	~50 GB	15 GB
Norms/Embed	—	~3 GB	3 GB (kept)
Total	6,265	61 GB	19 GB (-69%)

Benchmarks

Metric	Value
VRAM	58.0 GB
Speed	22.2 tok/s
Peak VRAM	58.3 GB
Polar Codes	19.0 GB (bit-packed)
Quantized	377 linear + 5,888 experts

Quality Validation

TCP vs UDP: Correct, structured explanation with thinking
Sieve of Eratosthenes: Correct Python implementation
Aurora Borealis: Accurate physics explanation

Architecture

47 layers, hidden=2048, 20 heads
MLA (Multi-head Latent Attention) — compressed KV via lora_rank=512
64 MoE experts, 4 active per token, 1 shared expert
131K context length
Speculative decoding support (MTP)
MIT license — fully open

Hardware

GPU	VRAM	Status
A100 (80 GB)	80 GB	Fits fully (58 GB used)
RTX PRO 6000 (96 GB)	96 GB	Fits fully
A100 (40 GB)	40 GB	Expert offloading needed
RTX 4090 (24 GB)	24 GB	Expert offloading needed

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load PQ5 codes and dequant
# See setup instructions at github.com/caiovicentino/polarengine-vllm

# Or use vLLM for serving:
# vllm serve zai-org/GLM-4.7-Flash --tool-call-parser glm47

Citation

@article{polarquant2026,
  title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}

61 GB → 19 GB with cos_sim >0.998. MIT license. Quantized with PolarQuant.

Downloads last month: 89

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/GLM-4.7-Flash-PolarQuant-Q5

Base model

zai-org/GLM-4.7-Flash

Quantized

(77)

this model

Collection including caiovicentino1/GLM-4.7-Flash-PolarQuant-Q5

PolarQuant Models

Collection

Optimal Gaussian quantization via Hadamard rotation. Beats torchao INT4 on PPL. arXiv: 2603.7424577 • 25 items • Updated 2 days ago • 1

Paper for caiovicentino1/GLM-4.7-Flash-PolarQuant-Q5

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Paper • 2603.29078 • Published 10 days ago

caiovicentino1
/

GLM-4.7-Flash-PolarQuant-Q5