Qwen3.5-27B-heretic-v3-autoround-w4a16

Quantized version of Qwen3.5-27B-heretic-v3 using Intel AutoRound (W4A16).

Quantization Details

  • Method: AutoRound (Weight-only INT4)
  • Precision: W4A16 (4-bit weights, 16-bit activations)
  • Framework: Intel Neural Compressor

Performance

  • Context Length: 150k tokens
  • Speed: ~63 tokens/sec on 2x RTX 3090
  • KV Cache: 97,216 tokens

Quality Benchmarks

Test Result
Logic (widgets) ✅ Correct
Math (derivatives) ✅ Correct
Coding ✅ Clean
Tricky reasoning ✅ Pass

Usage

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model ./Qwen3.5-27B-heretic-v3-autoround-w4a16 \
  --host 0.0.0.0 \
  --port 1234 \
  --dtype bfloat16 \
  --max-model-len 150000 \
  --quantization auto-round \
  --allow-deprecated-quantization \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95

Python

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16"
)

Hardware Requirements

  • Minimum: 2x GPU with 24GB VRAM each (for 150k context)
  • Recommended: 2x RTX 3090 / 4090 or equivalent

Credits

  • Base model: Qwen Team
  • Quantization: Intel AutoRound
  • Fine-tuning: Heretic v3
Downloads last month
68
Safetensors
Model size
3B params
Tensor type
BF16
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16

Base model

Qwen/Qwen2.5-32B
Quantized
(77)
this model