Qwen3.5-27B-heretic-v3-autoround-w4a16
Quantized version of Qwen3.5-27B-heretic-v3 using Intel AutoRound (W4A16).
Quantization Details
- Method: AutoRound (Weight-only INT4)
- Precision: W4A16 (4-bit weights, 16-bit activations)
- Framework: Intel Neural Compressor
Performance
- Context Length: 150k tokens
- Speed: ~63 tokens/sec on 2x RTX 3090
- KV Cache: 97,216 tokens
Quality Benchmarks
| Test | Result |
|---|---|
| Logic (widgets) | ✅ Correct |
| Math (derivatives) | ✅ Correct |
| Coding | ✅ Clean |
| Tricky reasoning | ✅ Pass |
Usage
vLLM
python -m vllm.entrypoints.openai.api_server \
--model ./Qwen3.5-27B-heretic-v3-autoround-w4a16 \
--host 0.0.0.0 \
--port 1234 \
--dtype bfloat16 \
--max-model-len 150000 \
--quantization auto-round \
--allow-deprecated-quantization \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95
Python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
"groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16"
)
Hardware Requirements
- Minimum: 2x GPU with 24GB VRAM each (for 150k context)
- Recommended: 2x RTX 3090 / 4090 or equivalent
Credits
- Base model: Qwen Team
- Quantization: Intel AutoRound
- Fine-tuning: Heretic v3
- Downloads last month
- 68
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for groxaxo/Qwen3.5-27B-heretic-v3-autoround-w4a16
Base model
Qwen/Qwen2.5-32B