Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit
AutoAWQ-format 4-bit quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.
This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but exports an AutoAWQ-compatible W4A16 checkpoint for broader AWQ tooling and runtime compatibility.
The published folder includes:
model-00001-of-00011.safetensors...model-00011-of-00011.safetensorsmodel.safetensors.index.jsonmodel.mtp.safetensorsquantization_config.jsonprocessor_config.jsonpreprocessor_config.jsonvideo_preprocessor_config.json
Verified Inference
Local export was completed on 2026-03-31 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:
auto-round==0.10.2transformers==5.3.0vllm==0.17.1
What was verified in that run:
- the AutoAWQ export completed successfully
quantization_config.jsonwas written withquant_method=awq- the output uses
bits=4,group_size=128,sym=false,zero_point=true model.mtp.safetensorswas restored into the output folder
Full local vLLM serve validation for this exact AWQ v2 export is still pending.
Quantization Strategy
AutoRound AutoAWQ export using W4A16 asymmetric group-wise quantization:
| Precision | Layers |
|---|---|
| INT4 weights + BF16 activations | most quantized linear layers |
| BF16 | lm_head, embed_tokens, self_attn.o_proj, DeltaNet linear_attn.out_proj, DeltaNet in_proj_a/in_proj_b, visual encoder, MTP sidecar |
AWQ details:
- weights: INT4
- activations: BF16/FP16 at inference time
- group size:
128 - asymmetric quantization:
sym=false - zero point:
true - format: AutoAWQ
gemm
Architecture match with the BF16 source:
model_type=qwen3_564text layersfull_attention_interval=4mtp_num_hidden_layers=1max_position_embeddings=262144
Local Benchmark Slice
No local benchmark slice is included yet for this AWQ v2 export.
The export completed successfully and the checkpoint layout is ready for upload, but full serve/runtime validation is still pending.
Usage
vLLM
pip install -U vllm==0.17.1 transformers==5.3.0
Expected serving command:
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3
With MTP enabled:
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Transformers
This export is not intended for plain transformers inference.
Use a runtime that understands AutoAWQ-format checkpoints, such as vLLM with AWQ support.
Compatibility
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.17.0 | Expected | Intended serving path for this AutoAWQ export; exact local serve validation still pending |
| transformers >= 5.3.0 | No | Plain transformers is not the intended inference path for this AutoAWQ checkpoint |
| AutoAWQ-compatible runtimes | Expected | Export format is AutoAWQ-style quant_method=awq, version=gemm |
| SGLang | Unknown | Not verified for this export |
Notes
- This is an AutoAWQ-format export, not the compressed-tensors AWQ format.
- The output keeps
self_attn.o_projand DeltaNetlinear_attn.out_projin BF16 rather than 4-bit. - The output folder includes the Qwen3.5 MTP sidecar and processor metadata needed for serving compatibility.
- Downloads last month
- 1,783
Model tree for mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit
Base model
Qwen/Qwen3.5-27B