--- language: - en - zh license: apache-2.0 base_model: Jackrong/Qwopus3.5-27B-v3 tags: - qwen3.5 - reasoning - quantized - fp8 - fp8-dynamic - compressed-tensors - deltanet - chain-of-thought - mtp pipeline_tag: text-generation library_name: transformers model_name: Qwopus3.5-27B-v3-FP8-Dynamic quantized_by: mconcat --- # Qwopus3.5-27B-v3-FP8-Dynamic FP8 Dynamic quantized version of [Jackrong/Qwopus3.5-27B-v3](https://huggingface.co/Jackrong/Qwopus3.5-27B-v3). This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, quantizing most linear layers to FP8 W8A8 while keeping the most sensitive projections and sidecar components in BF16. ## Verified Inference Local export and sanity-check evaluation were verified on **2026-04-07** on a single **NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB)** with: - `transformers==5.3.0` - `llm-compressor==0.14.1.dev24` - `vllm==0.17.1` What was verified: - FP8 export completed successfully via llm-compressor - MTP weights are included in the main safetensors file - The checkpoint loads in vLLM and generates correct output - Quick perplexity sanity check: **7.67** (FineWeb-Edu, 50 samples) ## Quantization Strategy Uniform FP8_DYNAMIC quantization using [llm-compressor](https://github.com/vllm-project/llm-compressor): | Precision | Layers | |-----------|--------| | **FP8 W8A8** | most `Linear` layers (per-channel static weight scales, per-token dynamic input scales) | | **BF16** | `lm_head`, `embed_tokens`, `self_attn.o_proj`, DeltaNet `linear_attn.out_proj`, DeltaNet `in_proj_a`/`in_proj_b`, visual encoder, MTP sidecar | Architecture match with the BF16 source: - `model_type=qwen3_5` - `64` text layers (hybrid DeltaNet + softmax, `full_attention_interval=4`) - `mtp_num_hidden_layers=1` - `max_position_embeddings=262144` - `hidden_size=5120`, `intermediate_size=17408` - `vocab_size=248320` ## Usage ### vLLM ```bash pip install -U vllm>=0.17.0 transformers>=5.3.0 ``` Standard serving: ```bash vllm serve mconcat/Qwopus3.5-27B-v3-FP8-Dynamic \ --max-model-len 32768 \ --gpu-memory-utilization 0.85 \ --max-num-seqs 1 \ --skip-mm-profiling \ --reasoning-parser qwen3 ``` With MTP speculative decoding: ```bash vllm serve mconcat/Qwopus3.5-27B-v3-FP8-Dynamic \ --max-model-len 32768 \ --gpu-memory-utilization 0.85 \ --max-num-seqs 1 \ --skip-mm-profiling \ --reasoning-parser qwen3 \ --speculative-config '{"method":"mtp","num_speculative_tokens":1}' ``` ### Transformers ```python from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration import torch model = Qwen3_5ForConditionalGeneration.from_pretrained( "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained( "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic", trust_remote_code=True, ) ``` ## Compatibility | Framework | Supported | Notes | |-----------|-----------|-------| | vLLM >= 0.17.0 | Yes | Verified with `vllm==0.17.1` on Blackwell; MTP works | | transformers >= 5.3.0 | Yes | Direct loading with `device_map="auto"` | | SGLang | Unknown | Not verified | ## Notes - This export keeps `self_attn.o_proj` and DeltaNet `linear_attn.out_proj` in BF16 to preserve output projection fidelity. - MTP weights are embedded in the main `model.safetensors` file (no separate `model.mtp.safetensors`). - The model includes a vision encoder (loaded but unused for text-only inference). Use `--skip-mm-profiling` with vLLM to skip vision encoder profiling. - **Blackwell (SM120) note:** If you encounter TMA-related crashes, apply the one-line vLLM patch to disable TMA on Blackwell: change `>= 9` to `9 <= x < 12` in `vllm/model_executor/layers/fla/ops/utils.py`. - **KV cache:** Do not use `--kv-cache-dtype fp8_e4m3` with this model family — the checkpoint lacks calibrated KV scales and will produce degraded output. Use the default BF16 KV cache.