---
language:
- en
- zh
license: apache-2.0
base_model: Jackrong/Qwopus3.5-27B-v3
tags:
- qwen3.5
- reasoning
- quantized
- fp8
- nvfp4
- mixed-precision
- compressed-tensors
- deltanet
- chain-of-thought
- mtp
pipeline_tag: text-generation
library_name: transformers
model_name: Qwopus3.5-27B-v3-NVFP4
quantized_by: ShinePixelOrg
---

# Qwopus3.5-27B-v3-NVFP4

Mixed-precision quantized version of [ShinePixelOrg/Qwopus3.5-27B-v3](https://huggingface.co/ShinePixelOrg/Qwopus3.5-27B-v3).

This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but applies the NVFP4/FP8/BF16 mixed recipe that worked well.

The published folder includes:
- `model.safetensors`
- `config.json`
- `recipe.yaml`
- `tokenizer.json`
- `tokenizer_config.json`
- `processor_config.json`
- `preprocessor_config.json`
- `video_preprocessor_config.json`
- `generation_config.json`
- `chat_template.jinja`

## Verified Inference

Local inference was verified on a single **NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB)** with:

- `vllm==0.17.1`
- `transformers==5.3.0`

Patch note:

- The old v1 one-line `vllm` patch for the Blackwell/TMA issue may still be required if you encounter the same problem.
- If your local `vllm` build does not already include that fix, apply the one-line patch.

Concrete patch command:

```bash
UTILS_FILE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))") && sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE"
```

The exact validated command for MTP-enabled serving was:

```bash
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'
```

The same model also serves without MTP:

```bash
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3
```

What was verified in that run:

- the server started cleanly
- `GET /health` returned `200`
- `GET /v1/models` returned the model
- `POST /v1/chat/completions` returned `200`
- MTP/speculative decoding was active and reported acceptance metrics in the server logs

## Quantization Strategy

Non-uniform mixed-precision quantization using [llm-compressor](https://github.com/vllm-project/llm-compressor):

| Precision | Layers |
|-----------|--------|
| **FP8 W8A8** | DeltaNet `in_proj_qkv`, `in_proj_z`, `out_proj`; softmax `q_proj`/`k_proj`/`v_proj`; MLP `down_proj` |
| **NVFP4 W4A4** | softmax `o_proj`; MLP `gate_proj`/`up_proj` |
| **BF16** | `lm_head`, `embed_tokens`, DeltaNet `in_proj_a`/`in_proj_b`, norms, visual encoder, MTP sidecar |

Architecture match with the BF16 source:

- `model_type=qwen3_5`
- `64` text layers
- `full_attention_interval=4`
- `mtp_num_hidden_layers=1`
- `max_position_embeddings=262144`

## Usage

### vLLM

```bash
pip install -U vllm transformers
```

With MTP:

```bash
vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'
```

Without MTP:

```bash
vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3
```

### Transformers

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4",
    trust_remote_code=True,
)
```

## Compatibility

| Framework | Supported | Notes |
|-----------|-----------|-------|
| vLLM >= 0.17.0 | Yes | Verified locally with `vllm==0.17.1` |
| transformers >= 5.3.0 | Yes | Direct loading works with `device_map="auto"` |