---
language:
- en
- zh
license: apache-2.0
base_model: Jackrong/Qwopus3.5-27B-v3
tags:
- qwen3.5
- reasoning
- quantized
- fp8
- fp8-dynamic
- compressed-tensors
- deltanet
- chain-of-thought
- mtp
pipeline_tag: text-generation
library_name: transformers
model_name: Qwopus3.5-27B-v3-FP8-Dynamic
quantized_by: mconcat
---

# Qwopus3.5-27B-v3-FP8-Dynamic

FP8 Dynamic quantized version of [Jackrong/Qwopus3.5-27B-v3](https://huggingface.co/Jackrong/Qwopus3.5-27B-v3).

This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, quantizing most linear layers to FP8 W8A8 while keeping the most sensitive projections and sidecar components in BF16.

## Verified Inference

Local export and sanity-check evaluation were verified on **2026-04-07** on a single **NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB)** with:

- `transformers==5.3.0`
- `llm-compressor==0.14.1.dev24`
- `vllm==0.17.1`

What was verified:

- FP8 export completed successfully via llm-compressor
- MTP weights are included in the main safetensors file
- The checkpoint loads in vLLM and generates correct output
- Quick perplexity sanity check: **7.67** (FineWeb-Edu, 50 samples)

## Quantization Strategy

Uniform FP8_DYNAMIC quantization using [llm-compressor](https://github.com/vllm-project/llm-compressor):

| Precision | Layers |
|-----------|--------|
| **FP8 W8A8** | most `Linear` layers (per-channel static weight scales, per-token dynamic input scales) |
| **BF16** | `lm_head`, `embed_tokens`, `self_attn.o_proj`, DeltaNet `linear_attn.out_proj`, DeltaNet `in_proj_a`/`in_proj_b`, visual encoder, MTP sidecar |

Architecture match with the BF16 source:

- `model_type=qwen3_5`
- `64` text layers (hybrid DeltaNet + softmax, `full_attention_interval=4`)
- `mtp_num_hidden_layers=1`
- `max_position_embeddings=262144`
- `hidden_size=5120`, `intermediate_size=17408`
- `vocab_size=248320`

## Usage

### vLLM

```bash
pip install -U vllm>=0.17.0 transformers>=5.3.0
```

Standard serving:

```bash
vllm serve mconcat/Qwopus3.5-27B-v3-FP8-Dynamic \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3
```

With MTP speculative decoding:

```bash
vllm serve mconcat/Qwopus3.5-27B-v3-FP8-Dynamic \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'
```

### Transformers

```python
from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic",
    trust_remote_code=True,
)
```

## Compatibility

| Framework | Supported | Notes |
|-----------|-----------|-------|
| vLLM >= 0.17.0 | Yes | Verified with `vllm==0.17.1` on Blackwell; MTP works |
| transformers >= 5.3.0 | Yes | Direct loading with `device_map="auto"` |
| SGLang | Unknown | Not verified |

## Notes

- This export keeps `self_attn.o_proj` and DeltaNet `linear_attn.out_proj` in BF16 to preserve output projection fidelity.
- MTP weights are embedded in the main `model.safetensors` file (no separate `model.mtp.safetensors`).
- The model includes a vision encoder (loaded but unused for text-only inference). Use `--skip-mm-profiling` with vLLM to skip vision encoder profiling.
- **Blackwell (SM120) note:** If you encounter TMA-related crashes, apply the one-line vLLM patch to disable TMA on Blackwell: change `>= 9` to `9 <= x < 12` in `vllm/model_executor/layers/fla/ops/utils.py`.
- **KV cache:** Do not use `--kv-cache-dtype fp8_e4m3` with this model family — the checkpoint lacks calibrated KV scales and will produce degraded output. Use the default BF16 KV cache.