--- license: mit base_model: Qwen/Qwen3.5-9B tags: - hipfire - amd - rdna - quantized - qwen3.5 library_name: hipfire --- # Qwen3.5-9B for hipfire Pre-quantized **Qwen3.5-9B** (DeltaNet hybrid) for [hipfire](https://github.com/Kaden-Schutt/hipfire), a Rust-native LLM inference engine for AMD RDNA GPUs. Quantized from [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B). ## Files | File | Quant | Size | Min VRAM | RX 5700 XT | RX 7900 XTX | |------|-------|------|----------|-----------|-------------| | qwen3.5-9b.hf4 | HF4 | 4.44 GB | 6 GB | 45 tok/s | 138 tok/s | | qwen3.5-9b.hf6 | HF6 | 6.79 GB | 8 GB | 37 tok/s | — | | qwen3.5-9b.mq4 | MQ4 ⭐ | 4.95 GB | 6 GB | TBD | 135 tok/s | Speeds are forward-only tok/s on the listed AMD GPU. ⭐ MQ4 ships with a mandatory byte-exact greedy quality gate (9 reference token streams). ## Usage ```bash # Install hipfire curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash # Pull and run any variant hipfire pull qwen3.5:9b # HF4 (default — fastest) hipfire pull qwen3.5:9b-mq4 # MQ4 (quality-gated, near-Q8 output) hipfire pull qwen3.5:9b-hf6 # HF6 (highest quality, ~15% slower) hipfire run qwen3.5:9b-mq4 "Hello" ``` ## Quantization Formats - **HF4 (HFQ4-G256)** — flat 4-bit, 256-weight groups (~0.53 B/w including per-group scale + zero). Best raw tok/s. Same storage layout as Q4_K_M in llama.cpp but without the K-quant block descriptors. - **HF6 (HFQ6-G256)** — flat 6-bit, 256-weight groups (~0.78 B/w). Highest quality, ~15% slower than HF4. Use this if you have VRAM headroom and want the smallest accuracy loss vs FP16. - **MQ4 (MagnumQuant 4-bit)** ⭐ — FWHT-rotated 4-bit. Storage layout identical to HF4 (4.25 B/w), but the weights are pre-rotated through a Walsh–Hadamard transform at quantization time, and the input `x` vector is rotated through the same transform on the fly during the GEMV. The rotation flattens outliers, dramatically improving the quantization-error distribution. Result: **roughly Q8-grade output quality at Q4 bandwidth**. Every commit that touches kernel or forward-pass code in the hipfire repo is gated against MQ4 byte-exact greedy decoding of 9 reference (model, prompt) pairs — see [tests/quality-baselines](https://github.com/Kaden-Schutt/hipfire/tree/master/tests/quality-baselines) and `scripts/quality-gate.sh`. Any silent numerical regression in the forward pass is caught at commit time. All formats embed the tokenizer and model config inside the model file — no separate `tokenizer.json` download needed. ## About hipfire Rust + HIP inference engine for AMD consumer GPUs (RDNA1–RDNA4). No Python in the hot path. The 0.1.4-alpha branch lands a kernel-fusion overhaul that roughly doubles forward speed on Qwen3.5 across the lineup vs the previous release. - GitHub: [Kaden-Schutt/hipfire](https://github.com/Kaden-Schutt/hipfire) - All models: [docs/MODELS.md](https://github.com/Kaden-Schutt/hipfire/blob/master/docs/MODELS.md) ## License Model weights subject to the original [Qwen license](https://huggingface.co/Qwen/Qwen3.5-9B). hipfire engine: MIT.