---
tags:
- techwithsergiu
library_name: transformers
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.5-4B/blob/main/LICENSE
pipeline_tag: text-generation
base_model:
- techwithsergiu/Qwen3.5-text-4B
---

# Qwen3.5-text-4B-bnb-4bit

<img width="400px" src="https://qianwen-res.oss-accelerate.aliyuncs.com/logo_qwen3.5.png">

BNB NF4 4-bit quantization of [techwithsergiu/Qwen3.5-text-4B](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B) —
a text-only derivative of [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B).

**No visual tower** — text input only. This is the recommended base for Unsloth LoRA
text fine-tuning: smaller VRAM footprint, no visual-dependency complexity, cleaner
adapter targeting.

Inference has been verified. LoRA fine-tuning docs are pending — see Fine-tuning section below.

## What was changed from the original Qwen3.5-4B

- Visual tower removed (same as `Qwen3.5-text-4B`)
- Text backbone quantized to BNB NF4 double-quant (`bnb_4bit_quant_type=nf4`, `bnb_4bit_compute_dtype=bfloat16`)
- `lm_head.weight` kept at **bf16** for output quality / stability

## Model family

![](diagrams/diagram_01.png)

| Model | Type | Base model |
|---|---|---|
| [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) | f16 · VLM · source | — |
| [techwithsergiu/Qwen3.5-4B-bnb-4bit](https://huggingface.co/techwithsergiu/Qwen3.5-4B-bnb-4bit) | BNB NF4 · VLM | Qwen/Qwen3.5-4B |
| [techwithsergiu/Qwen3.5-text-4B](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B) | bf16 · text-only | Qwen/Qwen3.5-4B |
| **[techwithsergiu/Qwen3.5-text-4B-bnb-4bit](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B-bnb-4bit)** | BNB NF4 · text-only | Qwen3.5-text-4B |
| [techwithsergiu/Qwen3.5-text-4B-GGUF](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B-GGUF) | GGUF quants | Qwen3.5-text-4B |

The visual tower scales with model size (~0.19 GB for 0.8B, ~0.62 GB for 2B/4B, ~0.85 GB for 9B).
BNB text-only models are roughly 34% of the original f16 size (4B example: 9.32 GB → 3.12 GB).

## Inference

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "techwithsergiu/Qwen3.5-text-4B-bnb-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is the capital of Romania?"}]

# Thinking OFF — direct answer
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print(response)

# Thinking ON — chain-of-thought before the answer
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print(response)
```

## Fine-tuning

This is the **primary training target** for text LoRA fine-tuning.

Training pipeline: **[github.com/techwithsergiu/qwen-qlora-train](https://techwithsergiu.github.io/qwen-qlora-train)**

QLoRA (Unsloth + TRL + PEFT) · rank 16–64 · validated on RTX 3070 8 GB

```bash
# Quick start — install
pip install "unsloth[cu124-torch260] @ git+https://github.com/unslothai/unsloth.git"
pip install git+https://github.com/techwithsergiu/qwen-qlora-train.git

# Train with a ready-made config
qlora-train configs/qwen35/0.8b.yaml   # or 2b / 4b
```

After training, test the adapter without merging:

```bash
qlora-infer \
  --model   techwithsergiu/Qwen3.5-text-4B-bnb-4bit \
  --adapter adapters/<run_name>
```

For VLM (image + text) fine-tuning of the full model, see:
[unsloth.ai/docs/models/qwen3.5/fine-tune](https://unsloth.ai/docs/models/qwen3.5/fine-tune)

## Pipeline diagram

![](diagrams/diagram_02.png)

## Conversion

Converted using [qwen35-toolkit](https://techwithsergiu.github.io/qwen35-toolkit) —
a Python toolkit for BNB quantization, visual tower removal, verification and
HF Hub publishing of Qwen3.5 models.

---

## Acknowledgements

Based on [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)
by the Qwen Team. If you use this model in research, please cite the original:

```bibtex
@misc{qwen3.5,
    title  = {{Qwen3.5}: Towards Native Multimodal Agents},
    author = {{Qwen Team}},
    month  = {February},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.5}
}
```