--- language: - en - zh license: apache-2.0 base_model: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 tags: - qwen3.5 - reasoning - quantized - awq - autoawq - 4bit - int4 - deltanet - chain-of-thought - mtp pipeline_tag: text-generation library_name: transformers model_name: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit quantized_by: mconcat --- # Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit AutoAWQ-format 4-bit quantized version of [Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2). This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but exports an AutoAWQ-compatible W4A16 checkpoint for broader AWQ tooling and runtime compatibility. The published folder includes: - `model-00001-of-00011.safetensors` ... `model-00011-of-00011.safetensors` - `model.safetensors.index.json` - `model.mtp.safetensors` - `quantization_config.json` - `processor_config.json` - `preprocessor_config.json` - `video_preprocessor_config.json` ## Verified Inference Local export was completed on **2026-03-31** on a single **NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB)** with: - `auto-round==0.10.2` - `transformers==5.3.0` - `vllm==0.17.1` What was verified in that run: - the AutoAWQ export completed successfully - `quantization_config.json` was written with `quant_method=awq` - the output uses `bits=4`, `group_size=128`, `sym=false`, `zero_point=true` - `model.mtp.safetensors` was restored into the output folder Full local vLLM serve validation for this exact AWQ v2 export is still pending. ## Quantization Strategy AutoRound AutoAWQ export using W4A16 asymmetric group-wise quantization: | Precision | Layers | |-----------|--------| | **INT4 weights + BF16 activations** | most quantized linear layers | | **BF16** | `lm_head`, `embed_tokens`, `self_attn.o_proj`, DeltaNet `linear_attn.out_proj`, DeltaNet `in_proj_a`/`in_proj_b`, visual encoder, MTP sidecar | AWQ details: - weights: INT4 - activations: BF16/FP16 at inference time - group size: `128` - asymmetric quantization: `sym=false` - zero point: `true` - format: AutoAWQ `gemm` Architecture match with the BF16 source: - `model_type=qwen3_5` - `64` text layers - `full_attention_interval=4` - `mtp_num_hidden_layers=1` - `max_position_embeddings=262144` ## Local Benchmark Slice No local benchmark slice is included yet for this AWQ v2 export. The export completed successfully and the checkpoint layout is ready for upload, but full serve/runtime validation is still pending. ## Usage ### vLLM ```bash pip install -U vllm==0.17.1 transformers==5.3.0 ``` Expected serving command: ```bash vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit \ --max-model-len 262144 \ --gpu-memory-utilization 0.85 \ --max-num-seqs 1 \ --skip-mm-profiling \ --reasoning-parser qwen3 ``` With MTP enabled: ```bash vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit \ --max-model-len 262144 \ --gpu-memory-utilization 0.85 \ --max-num-seqs 1 \ --skip-mm-profiling \ --reasoning-parser qwen3 \ --speculative-config '{"method":"mtp","num_speculative_tokens":1}' ``` ### Transformers This export is not intended for plain `transformers` inference. Use a runtime that understands AutoAWQ-format checkpoints, such as vLLM with AWQ support. ## Compatibility | Framework | Supported | Notes | |-----------|-----------|-------| | vLLM >= 0.17.0 | Expected | Intended serving path for this AutoAWQ export; exact local serve validation still pending | | transformers >= 5.3.0 | No | Plain `transformers` is not the intended inference path for this AutoAWQ checkpoint | | AutoAWQ-compatible runtimes | Expected | Export format is AutoAWQ-style `quant_method=awq`, `version=gemm` | | SGLang | Unknown | Not verified for this export | ## Notes - This is an AutoAWQ-format export, not the compressed-tensors AWQ format. - The output keeps `self_attn.o_proj` and DeltaNet `linear_attn.out_proj` in BF16 rather than 4-bit. - The output folder includes the Qwen3.5 MTP sidecar and processor metadata needed for serving compatibility.