Qwen3.5-27B-NVFP4-Opus-GB10

NVFP4 (W4A4) quantization of Qwen/Qwen3.5-27B, calibrated on reasoning datasets and optimized for serving on NVIDIA DGX Spark (GB10, Blackwell SM121).

~18 GB quantized — fits comfortably in GB10's 128 GB unified memory with room for 64k context and multi-user batching.

Quantization Details

Parameter Value
Method NVFP4 (W4A4 FP4)
Format nvfp4-pack-quantized
Library llmcompressor 0.10.0+ / compressed-tensors 0.14.0+
Group size 16
Weight bits 4-bit float
Activation bits 4-bit float (dynamic per-group at inference)
Scale dtype float8_e4m3fn
Weight observer memoryless_minmax
Activation calibration static_minmax (512 samples)
Symmetric Yes

Layers kept in BF16

  • lm_head — output projection (quantizing is optional via QUANTIZE_LM_HEAD=1)
  • linear_attn.conv1d — small 4D convolution kernels in linear attention layers
  • linear_attn.in_proj_ba — small projection layers in linear attention

These layers are excluded because FP4 kernel dispatch overhead exceeds the BF16 compute cost for their shapes.

Calibration

Calibrated on 512 samples (max sequence length 2048, seed 42) from three reasoning datasets:

Architecture

Qwen3.5-27B is a hybrid attention model alternating linear and full attention layers.

Parameter Value
Hidden size 5120
FFN intermediate size 17408
Attention heads 24
KV heads 4 (GQA)
Head dimension 256
Layers 64
Attention pattern 3x linear + 1x full (repeating)
Max position embeddings 262,144
Vocab size 248,320

How to Serve (vLLM on DGX Spark)

docker run -d \
  --gpus all --ipc=host --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_NVFP4_GEMM_BACKEND=cutlass \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  nvllm:gb10 \
  serve \
  --model natfii/Qwen3.5-27B-NVFP4-Opus-GB10 \
  --served-model-name default \
  --kv-cache-dtype auto \
  --attention-backend triton_attn \
  --max-model-len 65536 \
  --max-num-seqs 4 \
  --language-model-only \
  --enable-prefix-caching \
  --mamba-cache-mode align \
  --mamba-block-size 64 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --gpu-memory-utilization 0.80

Requires vLLM >= 0.19.0 with NVFP4 support (Blackwell SM120+).

Testing done on a WIP fork of vLLM for GB10.

Hardware

  • Target: NVIDIA DGX Spark (GB10, Blackwell SM121)
  • Memory: 128 GB unified, 221 GB/s bandwidth
  • Benchmarks: Coming soon

Notes

  • The quantization script includes compatibility shims for transformers 5.x (TORCH_INIT_FUNCTIONS and _get_no_split_modules patches).
  • Qwen3.5's hybrid linear attention layers require --language-model-only, --mamba-cache-mode align, and --mamba-block-size 64 flags in vLLM. vLLM reuses its Mamba cache infrastructure to handle the linear attention state.
  • To also quantize lm_head to FP4, re-run quantization with QUANTIZE_LM_HEAD=1.
Downloads last month
144
Safetensors
Model size
16B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for natfii/Qwen3.5-27B-NVFP4-Opus-GB10

Base model

Qwen/Qwen3.5-27B
Quantized
(163)
this model

Datasets used to train natfii/Qwen3.5-27B-NVFP4-Opus-GB10