Qwopus v3 quants
Collection
3 items • Updated
FP8 Dynamic quantized version of Jackrong/Qwopus3.5-27B-v3.
This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, quantizing most linear layers to FP8 W8A8 while keeping the most sensitive projections and sidecar components in BF16.
Local export and sanity-check evaluation were verified on 2026-04-07 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:
transformers==5.3.0llm-compressor==0.14.1.dev24vllm==0.17.1What was verified:
Uniform FP8_DYNAMIC quantization using llm-compressor:
| Precision | Layers |
|---|---|
| FP8 W8A8 | most Linear layers (per-channel static weight scales, per-token dynamic input scales) |
| BF16 | lm_head, embed_tokens, self_attn.o_proj, DeltaNet linear_attn.out_proj, DeltaNet in_proj_a/in_proj_b, visual encoder, MTP sidecar |
Architecture match with the BF16 source:
model_type=qwen3_564 text layers (hybrid DeltaNet + softmax, full_attention_interval=4)mtp_num_hidden_layers=1max_position_embeddings=262144hidden_size=5120, intermediate_size=17408vocab_size=248320pip install -U vllm>=0.17.0 transformers>=5.3.0
Standard serving:
vllm serve mconcat/Qwopus3.5-27B-v3-FP8-Dynamic \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3
With MTP speculative decoding:
vllm serve mconcat/Qwopus3.5-27B-v3-FP8-Dynamic \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"mconcat/Qwopus3.5-27B-v3-FP8-Dynamic",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"mconcat/Qwopus3.5-27B-v3-FP8-Dynamic",
trust_remote_code=True,
)
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.17.0 | Yes | Verified with vllm==0.17.1 on Blackwell; MTP works |
| transformers >= 5.3.0 | Yes | Direct loading with device_map="auto" |
| SGLang | Unknown | Not verified |
self_attn.o_proj and DeltaNet linear_attn.out_proj in BF16 to preserve output projection fidelity.model.safetensors file (no separate model.mtp.safetensors).--skip-mm-profiling with vLLM to skip vision encoder profiling.>= 9 to 9 <= x < 12 in vllm/model_executor/layers/fla/ops/utils.py.--kv-cache-dtype fp8_e4m3 with this model family — the checkpoint lacks calibrated KV scales and will produce degraded output. Use the default BF16 KV cache.