--- language: - en - zh license: apache-2.0 base_model: Jackrong/Qwopus3.5-27B-v3 tags: - qwen3.5 - reasoning - quantized - fp8 - nvfp4 - mixed-precision - compressed-tensors - deltanet - chain-of-thought - mtp pipeline_tag: text-generation library_name: transformers model_name: Qwopus3.5-27B-v3-NVFP4 quantized_by: ShinePixelOrg --- # Qwopus3.5-27B-v3-NVFP4 Mixed-precision quantized version of [ShinePixelOrg/Qwopus3.5-27B-v3](https://huggingface.co/ShinePixelOrg/Qwopus3.5-27B-v3). This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but applies the NVFP4/FP8/BF16 mixed recipe that worked well. The published folder includes: - `model.safetensors` - `config.json` - `recipe.yaml` - `tokenizer.json` - `tokenizer_config.json` - `processor_config.json` - `preprocessor_config.json` - `video_preprocessor_config.json` - `generation_config.json` - `chat_template.jinja` ## Verified Inference Local inference was verified on a single **NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB)** with: - `vllm==0.17.1` - `transformers==5.3.0` Patch note: - The old v1 one-line `vllm` patch for the Blackwell/TMA issue may still be required if you encounter the same problem. - If your local `vllm` build does not already include that fix, apply the one-line patch. Concrete patch command: ```bash UTILS_FILE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))") && sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE" ``` The exact validated command for MTP-enabled serving was: ```bash CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \ vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \ --max-model-len 262144 \ --gpu-memory-utilization 0.85 \ --kv-cache-dtype fp8_e4m3 \ --max-num-seqs 1 \ --skip-mm-profiling \ --reasoning-parser qwen3 \ --speculative-config '{"method":"mtp","num_speculative_tokens":1}' ``` The same model also serves without MTP: ```bash CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \ vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \ --max-model-len 262144 \ --gpu-memory-utilization 0.85 \ --kv-cache-dtype fp8_e4m3 \ --max-num-seqs 1 \ --skip-mm-profiling \ --reasoning-parser qwen3 ``` What was verified in that run: - the server started cleanly - `GET /health` returned `200` - `GET /v1/models` returned the model - `POST /v1/chat/completions` returned `200` - MTP/speculative decoding was active and reported acceptance metrics in the server logs ## Quantization Strategy Non-uniform mixed-precision quantization using [llm-compressor](https://github.com/vllm-project/llm-compressor): | Precision | Layers | |-----------|--------| | **FP8 W8A8** | DeltaNet `in_proj_qkv`, `in_proj_z`, `out_proj`; softmax `q_proj`/`k_proj`/`v_proj`; MLP `down_proj` | | **NVFP4 W4A4** | softmax `o_proj`; MLP `gate_proj`/`up_proj` | | **BF16** | `lm_head`, `embed_tokens`, DeltaNet `in_proj_a`/`in_proj_b`, norms, visual encoder, MTP sidecar | Architecture match with the BF16 source: - `model_type=qwen3_5` - `64` text layers - `full_attention_interval=4` - `mtp_num_hidden_layers=1` - `max_position_embeddings=262144` ## Usage ### vLLM ```bash pip install -U vllm transformers ``` With MTP: ```bash vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \ --max-model-len 262144 \ --gpu-memory-utilization 0.85 \ --kv-cache-dtype fp8_e4m3 \ --max-num-seqs 1 \ --skip-mm-profiling \ --reasoning-parser qwen3 \ --speculative-config '{"method":"mtp","num_speculative_tokens":1}' ``` Without MTP: ```bash vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \ --max-model-len 262144 \ --gpu-memory-utilization 0.85 \ --kv-cache-dtype fp8_e4m3 \ --max-num-seqs 1 \ --skip-mm-profiling \ --reasoning-parser qwen3 ``` ### Transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model = AutoModelForCausalLM.from_pretrained( "ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained( "ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4", trust_remote_code=True, ) ``` ## Compatibility | Framework | Supported | Notes | |-----------|-----------|-------| | vLLM >= 0.17.0 | Yes | Verified locally with `vllm==0.17.1` | | transformers >= 5.3.0 | Yes | Direct loading works with `device_map="auto"` |