Qwen3.5-27B-heretic-v3-autoround-w4a16

Quantized version of Qwen3.5-27B-heretic-v3 using Intel AutoRound (W4A16).

Quantization Details

Method: AutoRound (Weight-only INT4)
Precision: W4A16 (4-bit weights, 16-bit activations)
Framework: Intel Neural Compressor

Performance

Context Length: 150k tokens
Speed: ~63 tokens/sec on 2x RTX 3090
KV Cache: 97,216 tokens

Quality Benchmarks

Test	Result
Logic (widgets)	✅ Correct
Math (derivatives)	✅ Correct
Coding	✅ Clean
Tricky reasoning	✅ Pass

Usage

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model ./Qwen3.5-27B-heretic-v3-autoround-w4a16 \
  --host 0.0.0.0 \
  --port 1234 \
  --dtype bfloat16 \
  --max-model-len 150000 \
  --quantization auto-round \
  --allow-deprecated-quantization \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95

Python

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16"
)

Hardware Requirements

Minimum: 2x GPU with 24GB VRAM each (for 150k context)
Recommended: 2x RTX 3090 / 4090 or equivalent

Credits

Base model: Qwen Team
Quantization: Intel AutoRound
Fine-tuning: Heretic v3

Downloads last month: 68

Safetensors

Model size

3B params

Tensor type

BF16

I32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16

Base model

Qwen/Qwen2.5-32B

Quantized

(77)

this model