Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit

AutoAWQ-format 4-bit quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.

This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but exports an AutoAWQ-compatible W4A16 checkpoint for broader AWQ tooling and runtime compatibility.

The published folder includes:

  • model-00001-of-00011.safetensors ... model-00011-of-00011.safetensors
  • model.safetensors.index.json
  • model.mtp.safetensors
  • quantization_config.json
  • processor_config.json
  • preprocessor_config.json
  • video_preprocessor_config.json

Verified Inference

Local export was completed on 2026-03-31 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

  • auto-round==0.10.2
  • transformers==5.3.0
  • vllm==0.17.1

What was verified in that run:

  • the AutoAWQ export completed successfully
  • quantization_config.json was written with quant_method=awq
  • the output uses bits=4, group_size=128, sym=false, zero_point=true
  • model.mtp.safetensors was restored into the output folder

Full local vLLM serve validation for this exact AWQ v2 export is still pending.

Quantization Strategy

AutoRound AutoAWQ export using W4A16 asymmetric group-wise quantization:

Precision Layers
INT4 weights + BF16 activations most quantized linear layers
BF16 lm_head, embed_tokens, self_attn.o_proj, DeltaNet linear_attn.out_proj, DeltaNet in_proj_a/in_proj_b, visual encoder, MTP sidecar

AWQ details:

  • weights: INT4
  • activations: BF16/FP16 at inference time
  • group size: 128
  • asymmetric quantization: sym=false
  • zero point: true
  • format: AutoAWQ gemm

Architecture match with the BF16 source:

  • model_type=qwen3_5
  • 64 text layers
  • full_attention_interval=4
  • mtp_num_hidden_layers=1
  • max_position_embeddings=262144

Local Benchmark Slice

No local benchmark slice is included yet for this AWQ v2 export.

The export completed successfully and the checkpoint layout is ready for upload, but full serve/runtime validation is still pending.

Usage

vLLM

pip install -U vllm==0.17.1 transformers==5.3.0

Expected serving command:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

With MTP enabled:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Transformers

This export is not intended for plain transformers inference.

Use a runtime that understands AutoAWQ-format checkpoints, such as vLLM with AWQ support.

Compatibility

Framework Supported Notes
vLLM >= 0.17.0 Expected Intended serving path for this AutoAWQ export; exact local serve validation still pending
transformers >= 5.3.0 No Plain transformers is not the intended inference path for this AutoAWQ checkpoint
AutoAWQ-compatible runtimes Expected Export format is AutoAWQ-style quant_method=awq, version=gemm
SGLang Unknown Not verified for this export

Notes

  • This is an AutoAWQ-format export, not the compressed-tensors AWQ format.
  • The output keeps self_attn.o_proj and DeltaNet linear_attn.out_proj in BF16 rather than 4-bit.
  • The output folder includes the Qwen3.5 MTP sidecar and processor metadata needed for serving compatibility.
Downloads last month
1,783
Safetensors
Model size
29B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit

Collection including mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit