Qwopus3.5-27B-v3-NVFP4

Mixed-precision (NVFP4/FP8/BF16) quantized version of Jackrong/Qwopus3.5-27B-v3.

This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, applying a layerwise mixed-precision recipe that balances compression with output quality.

Verified Inference

Local inference was verified on 2026-04-07 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

  • vllm==0.17.1
  • transformers==5.3.0
  • llm-compressor==0.14.1.dev24

What was verified:

  • The mixed-precision export completed successfully via llm-compressor
  • MTP weights are included in the main safetensors file
  • The checkpoint loads and generates correct output in vLLM

Blackwell GPU Notes

On Blackwell GPUs (RTX 5090, RTX PRO 6000), two patches may be required:

  1. TMA patch: Change >= 9 to 9 <= x < 12 in vllm/model_executor/layers/fla/ops/utils.py
UTILS_FILE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))") && \
sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE"
  1. NVFP4 GEMM backend: FlashInfer's FP4 GEMM kernel has a known bug on SM120 (consumer Blackwell). Use the CUTLASS backend instead:
export VLLM_NVFP4_GEMM_BACKEND=cutlass

Quantization Strategy

Non-uniform mixed-precision quantization using llm-compressor:

Precision Layers
FP8 W8A8 DeltaNet in_proj_qkv, in_proj_z, out_proj; softmax q_proj/k_proj/v_proj; MLP down_proj
NVFP4 W4A4 softmax o_proj; MLP gate_proj/up_proj
BF16 lm_head, embed_tokens, DeltaNet in_proj_a/in_proj_b, norms, visual encoder, MTP sidecar

Architecture match with the BF16 source:

  • model_type=qwen3_5
  • 64 text layers (hybrid DeltaNet + softmax, full_attention_interval=4)
  • mtp_num_hidden_layers=1
  • max_position_embeddings=262144
  • hidden_size=5120, intermediate_size=17408
  • vocab_size=248320

Usage

vLLM

pip install -U vllm>=0.17.0 transformers>=5.3.0

Standard serving:

vllm serve mconcat/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

With MTP speculative decoding:

vllm serve mconcat/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

On Blackwell GPUs, add the CUTLASS backend:

VLLM_NVFP4_GEMM_BACKEND=cutlass vllm serve mconcat/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Transformers

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwopus3.5-27B-v3-NVFP4",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwopus3.5-27B-v3-NVFP4",
    trust_remote_code=True,
)

Compatibility

Framework Supported Notes
vLLM >= 0.17.0 Yes Verified with vllm==0.17.1; MTP works; Blackwell requires CUTLASS backend
transformers >= 5.3.0 Yes Direct loading with device_map="auto"
SGLang No compressed-tensors NVFP4 not supported

Notes

  • This is the smallest quantized variant (~24 GB) and fits comfortably on a single 32 GB GPU in eager mode.
  • MTP weights are embedded in the main model.safetensors file (no separate model.mtp.safetensors).
  • The model includes a vision encoder (loaded but unused for text-only inference). Use --skip-mm-profiling with vLLM to skip vision encoder profiling.
  • KV cache: Do not use --kv-cache-dtype fp8_e4m3 with this model family — the checkpoint lacks calibrated KV scales and will produce degraded output. Use the default BF16 KV cache.
Downloads last month
227
Safetensors
Model size
22B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mconcat/Qwopus3.5-27B-v3-NVFP4

Base model

Qwen/Qwen3.5-27B
Quantized
(23)
this model

Collection including mconcat/Qwopus3.5-27B-v3-NVFP4