Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit

AutoAWQ-format 4-bit quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.

This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but exports an AutoAWQ-compatible W4A16 checkpoint for broader AWQ tooling and runtime compatibility.

The published folder includes:

model-00001-of-00011.safetensors ... model-00011-of-00011.safetensors
model.safetensors.index.json
model.mtp.safetensors
quantization_config.json
processor_config.json
preprocessor_config.json
video_preprocessor_config.json

Verified Inference

Local export was completed on 2026-03-31 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

auto-round==0.10.2
transformers==5.3.0
vllm==0.17.1

What was verified in that run:

the AutoAWQ export completed successfully
quantization_config.json was written with quant_method=awq
the output uses bits=4, group_size=128, sym=false, zero_point=true
model.mtp.safetensors was restored into the output folder

Full local vLLM serve validation for this exact AWQ v2 export is still pending.

Quantization Strategy

AutoRound AutoAWQ export using W4A16 asymmetric group-wise quantization:

Precision	Layers
INT4 weights + BF16 activations	most quantized linear layers
BF16	`lm_head`, `embed_tokens`, `self_attn.o_proj`, DeltaNet `linear_attn.out_proj`, DeltaNet `in_proj_a`/`in_proj_b`, visual encoder, MTP sidecar

AWQ details:

weights: INT4
activations: BF16/FP16 at inference time
group size: 128
asymmetric quantization: sym=false
zero point: true
format: AutoAWQ gemm

Architecture match with the BF16 source:

model_type=qwen3_5
64 text layers
full_attention_interval=4
mtp_num_hidden_layers=1
max_position_embeddings=262144

Local Benchmark Slice

No local benchmark slice is included yet for this AWQ v2 export.

The export completed successfully and the checkpoint layout is ready for upload, but full serve/runtime validation is still pending.

Usage

vLLM

pip install -U vllm==0.17.1 transformers==5.3.0

Expected serving command:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

With MTP enabled:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Transformers

This export is not intended for plain transformers inference.

Use a runtime that understands AutoAWQ-format checkpoints, such as vLLM with AWQ support.

Compatibility

Framework	Supported	Notes
vLLM >= 0.17.0	Expected	Intended serving path for this AutoAWQ export; exact local serve validation still pending
transformers >= 5.3.0	No	Plain `transformers` is not the intended inference path for this AutoAWQ checkpoint
AutoAWQ-compatible runtimes	Expected	Export format is AutoAWQ-style `quant_method=awq`, `version=gemm`
SGLang	Unknown	Not verified for this export

Notes

This is an AutoAWQ-format export, not the compressed-tensors AWQ format.
The output keeps self_attn.o_proj and DeltaNet linear_attn.out_proj in BF16 rather than 4-bit.
The output folder includes the Qwen3.5 MTP sidecar and processor metadata needed for serving compatibility.