---
license: mit
base_model: Qwen/Qwen3.5-9B
tags:
  - hipfire
  - amd
  - rdna
  - quantized
  - qwen3.5
library_name: hipfire
---

# Qwen3.5-9B for hipfire

Pre-quantized **Qwen3.5-9B** (DeltaNet hybrid) for [hipfire](https://github.com/Kaden-Schutt/hipfire),
a Rust-native LLM inference engine for AMD RDNA GPUs.

Quantized from [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B).

## Files

| File | Quant | Size | Min VRAM | RX 5700 XT | RX 7900 XTX |
|------|-------|------|----------|-----------|-------------|
| qwen3.5-9b.hf4 | HF4 | 4.44 GB | 6 GB | 45 tok/s | 138 tok/s |
| qwen3.5-9b.hf6 | HF6 | 6.79 GB | 8 GB | 37 tok/s | — |
| qwen3.5-9b.mq4 | MQ4 ⭐ | 4.95 GB | 6 GB | TBD | 135 tok/s |

Speeds are forward-only tok/s on the listed AMD GPU. ⭐ MQ4 ships with a
mandatory byte-exact greedy quality gate (9 reference token streams).

## Usage

```bash
# Install hipfire
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

# Pull and run any variant
hipfire pull qwen3.5:9b            # HF4 (default — fastest)
hipfire pull qwen3.5:9b-mq4        # MQ4 (quality-gated, near-Q8 output)
hipfire pull qwen3.5:9b-hf6        # HF6 (highest quality, ~15% slower)

hipfire run qwen3.5:9b-mq4 "Hello"
```

## Quantization Formats

- **HF4 (HFQ4-G256)** — flat 4-bit, 256-weight groups (~0.53 B/w including
  per-group scale + zero). Best raw tok/s. Same storage layout as Q4_K_M
  in llama.cpp but without the K-quant block descriptors.

- **HF6 (HFQ6-G256)** — flat 6-bit, 256-weight groups (~0.78 B/w). Highest
  quality, ~15% slower than HF4. Use this if you have VRAM headroom and
  want the smallest accuracy loss vs FP16.

- **MQ4 (MagnumQuant 4-bit)** ⭐ — FWHT-rotated 4-bit. Storage layout
  identical to HF4 (4.25 B/w), but the weights are pre-rotated through a
  Walsh–Hadamard transform at quantization time, and the input `x`
  vector is rotated through the same transform on the fly during the GEMV.
  The rotation flattens outliers, dramatically improving the
  quantization-error distribution. Result: **roughly Q8-grade output
  quality at Q4 bandwidth**.

  Every commit that touches kernel or forward-pass code in the hipfire repo
  is gated against MQ4 byte-exact greedy decoding of 9 reference (model,
  prompt) pairs — see
  [tests/quality-baselines](https://github.com/Kaden-Schutt/hipfire/tree/master/tests/quality-baselines)
  and `scripts/quality-gate.sh`. Any silent numerical regression in the
  forward pass is caught at commit time.

All formats embed the tokenizer and model config inside the model file —
no separate `tokenizer.json` download needed.

## About hipfire

Rust + HIP inference engine for AMD consumer GPUs (RDNA1–RDNA4). No Python
in the hot path. The 0.1.4-alpha branch lands a kernel-fusion overhaul that
roughly doubles forward speed on Qwen3.5 across the lineup vs the previous
release.

- GitHub: [Kaden-Schutt/hipfire](https://github.com/Kaden-Schutt/hipfire)
- All models: [docs/MODELS.md](https://github.com/Kaden-Schutt/hipfire/blob/master/docs/MODELS.md)

## License

Model weights subject to the original
[Qwen license](https://huggingface.co/Qwen/Qwen3.5-9B).
hipfire engine: MIT.