GLM-4.7-Flash β€” PolarQuant Q5 (Bit-Packed)

PQ5+INT4 weights for consumer GPU inference.

30B-A3B MoE | MLA attention | MIT license | 22.2 tok/s

61 GB β†’ 19 GB (-69%) | cos_sim >0.998 | 6,265 layers quantized

Download Size

Download

Compression

Compression

Component Layers Original PQ5 Packed
nn.Linear (INT4) 377 ~8 GB 1.8 GB
MoE Experts 5,888 slices ~50 GB 15 GB
Norms/Embed β€” ~3 GB 3 GB (kept)
Total 6,265 61 GB 19 GB (-69%)

Benchmarks

Metric Value
VRAM 58.0 GB
Speed 22.2 tok/s
Peak VRAM 58.3 GB
Polar Codes 19.0 GB (bit-packed)
Quantized 377 linear + 5,888 experts

Quality Validation

  • TCP vs UDP: Correct, structured explanation with thinking
  • Sieve of Eratosthenes: Correct Python implementation
  • Aurora Borealis: Accurate physics explanation

Architecture

  • 47 layers, hidden=2048, 20 heads
  • MLA (Multi-head Latent Attention) β€” compressed KV via lora_rank=512
  • 64 MoE experts, 4 active per token, 1 shared expert
  • 131K context length
  • Speculative decoding support (MTP)
  • MIT license β€” fully open

Hardware

GPU VRAM Status
A100 (80 GB) 80 GB Fits fully (58 GB used)
RTX PRO 6000 (96 GB) 96 GB Fits fully
A100 (40 GB) 40 GB Expert offloading needed
RTX 4090 (24 GB) 24 GB Expert offloading needed

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load PQ5 codes and dequant
# See setup instructions at github.com/caiovicentino/polarengine-vllm

# Or use vLLM for serving:
# vllm serve zai-org/GLM-4.7-Flash --tool-call-parser glm47

Links

Citation

@article{polarquant2026,
  title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}

61 GB β†’ 19 GB with cos_sim >0.998. MIT license. Quantized with PolarQuant.

Downloads last month
89
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/GLM-4.7-Flash-PolarQuant-Q5

Quantized
(77)
this model

Collection including caiovicentino1/GLM-4.7-Flash-PolarQuant-Q5

Paper for caiovicentino1/GLM-4.7-Flash-PolarQuant-Q5