Qwopus v3 quants
Collection
3 items • Updated
Mixed-precision (NVFP4/FP8/BF16) quantized version of Jackrong/Qwopus3.5-27B-v3.
This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, applying a layerwise mixed-precision recipe that balances compression with output quality.
Local inference was verified on 2026-04-07 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:
vllm==0.17.1transformers==5.3.0llm-compressor==0.14.1.dev24What was verified:
On Blackwell GPUs (RTX 5090, RTX PRO 6000), two patches may be required:
>= 9 to 9 <= x < 12 in vllm/model_executor/layers/fla/ops/utils.pyUTILS_FILE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))") && \
sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE"
export VLLM_NVFP4_GEMM_BACKEND=cutlass
Non-uniform mixed-precision quantization using llm-compressor:
| Precision | Layers |
|---|---|
| FP8 W8A8 | DeltaNet in_proj_qkv, in_proj_z, out_proj; softmax q_proj/k_proj/v_proj; MLP down_proj |
| NVFP4 W4A4 | softmax o_proj; MLP gate_proj/up_proj |
| BF16 | lm_head, embed_tokens, DeltaNet in_proj_a/in_proj_b, norms, visual encoder, MTP sidecar |
Architecture match with the BF16 source:
model_type=qwen3_564 text layers (hybrid DeltaNet + softmax, full_attention_interval=4)mtp_num_hidden_layers=1max_position_embeddings=262144hidden_size=5120, intermediate_size=17408vocab_size=248320pip install -U vllm>=0.17.0 transformers>=5.3.0
Standard serving:
vllm serve mconcat/Qwopus3.5-27B-v3-NVFP4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3
With MTP speculative decoding:
vllm serve mconcat/Qwopus3.5-27B-v3-NVFP4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
On Blackwell GPUs, add the CUTLASS backend:
VLLM_NVFP4_GEMM_BACKEND=cutlass vllm serve mconcat/Qwopus3.5-27B-v3-NVFP4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"mconcat/Qwopus3.5-27B-v3-NVFP4",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"mconcat/Qwopus3.5-27B-v3-NVFP4",
trust_remote_code=True,
)
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.17.0 | Yes | Verified with vllm==0.17.1; MTP works; Blackwell requires CUTLASS backend |
| transformers >= 5.3.0 | Yes | Direct loading with device_map="auto" |
| SGLang | No | compressed-tensors NVFP4 not supported |
model.safetensors file (no separate model.mtp.safetensors).--skip-mm-profiling with vLLM to skip vision encoder profiling.--kv-cache-dtype fp8_e4m3 with this model family — the checkpoint lacks calibrated KV scales and will produce degraded output. Use the default BF16 KV cache.