Qwopus3.5-27B-v3-FP8-Dynamic / README.md

Upload folder using huggingface_hub

1e10d6a verified about 18 hours ago

4.04 kB

language:
  - en
  - zh
license: apache-2.0
base_model: Jackrong/Qwopus3.5-27B-v3
tags:
  - qwen3.5
  - reasoning
  - quantized
  - fp8
  - fp8-dynamic
  - compressed-tensors
  - deltanet
  - chain-of-thought
  - mtp
pipeline_tag: text-generation
library_name: transformers
model_name: Qwopus3.5-27B-v3-FP8-Dynamic
quantized_by: mconcat

Qwopus3.5-27B-v3-FP8-Dynamic

FP8 Dynamic quantized version of Jackrong/Qwopus3.5-27B-v3.

This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, quantizing most linear layers to FP8 W8A8 while keeping the most sensitive projections and sidecar components in BF16.

Verified Inference

Local export and sanity-check evaluation were verified on 2026-04-07 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

transformers==5.3.0
llm-compressor==0.14.1.dev24
vllm==0.17.1

What was verified:

FP8 export completed successfully via llm-compressor
MTP weights are included in the main safetensors file
The checkpoint loads in vLLM and generates correct output
Quick perplexity sanity check: 7.67 (FineWeb-Edu, 50 samples)

Quantization Strategy

Uniform FP8_DYNAMIC quantization using llm-compressor:

Precision	Layers
FP8 W8A8	most `Linear` layers (per-channel static weight scales, per-token dynamic input scales)
BF16	`lm_head`, `embed_tokens`, `self_attn.o_proj`, DeltaNet `linear_attn.out_proj`, DeltaNet `in_proj_a`/`in_proj_b`, visual encoder, MTP sidecar

Architecture match with the BF16 source:

model_type=qwen3_5
64 text layers (hybrid DeltaNet + softmax, full_attention_interval=4)
mtp_num_hidden_layers=1
max_position_embeddings=262144
hidden_size=5120, intermediate_size=17408
vocab_size=248320

Usage

vLLM

pip install -U vllm>=0.17.0 transformers>=5.3.0

Standard serving:

vllm serve mconcat/Qwopus3.5-27B-v3-FP8-Dynamic \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

With MTP speculative decoding:

vllm serve mconcat/Qwopus3.5-27B-v3-FP8-Dynamic \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Transformers

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic",
    trust_remote_code=True,
)

Compatibility

Framework	Supported	Notes
vLLM >= 0.17.0	Yes	Verified with `vllm==0.17.1` on Blackwell; MTP works
transformers >= 5.3.0	Yes	Direct loading with `device_map="auto"`
SGLang	Unknown	Not verified

Notes

This export keeps self_attn.o_proj and DeltaNet linear_attn.out_proj in BF16 to preserve output projection fidelity.
MTP weights are embedded in the main model.safetensors file (no separate model.mtp.safetensors).
The model includes a vision encoder (loaded but unused for text-only inference). Use --skip-mm-profiling with vLLM to skip vision encoder profiling.
Blackwell (SM120) note: If you encounter TMA-related crashes, apply the one-line vLLM patch to disable TMA on Blackwell: change >= 9 to 9 <= x < 12 in vllm/model_executor/layers/fla/ops/utils.py.
KV cache: Do not use --kv-cache-dtype fp8_e4m3 with this model family — the checkpoint lacks calibrated KV scales and will produce degraded output. Use the default BF16 KV cache.