| --- |
| tags: |
| - techwithsergiu |
| library_name: transformers |
| license: apache-2.0 |
| license_link: https://huggingface.co/Qwen/Qwen3.5-4B/blob/main/LICENSE |
| pipeline_tag: text-generation |
| base_model: |
| - techwithsergiu/Qwen3.5-text-4B |
| --- |
| |
| # Qwen3.5-text-4B-bnb-4bit |
|
|
| <img width="400px" src="https://qianwen-res.oss-accelerate.aliyuncs.com/logo_qwen3.5.png"> |
|
|
| BNB NF4 4-bit quantization of [techwithsergiu/Qwen3.5-text-4B](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B) — |
| a text-only derivative of [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B). |
|
|
| **No visual tower** — text input only. This is the recommended base for Unsloth LoRA |
| text fine-tuning: smaller VRAM footprint, no visual-dependency complexity, cleaner |
| adapter targeting. |
|
|
| Inference has been verified. LoRA fine-tuning docs are pending — see Fine-tuning section below. |
|
|
| ## What was changed from the original Qwen3.5-4B |
|
|
| - Visual tower removed (same as `Qwen3.5-text-4B`) |
| - Text backbone quantized to BNB NF4 double-quant (`bnb_4bit_quant_type=nf4`, `bnb_4bit_compute_dtype=bfloat16`) |
| - `lm_head.weight` kept at **bf16** for output quality / stability |
|
|
| ## Model family |
|
|
|  |
|
|
| | Model | Type | Base model | |
| |---|---|---| |
| | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) | f16 · VLM · source | — | |
| | [techwithsergiu/Qwen3.5-4B-bnb-4bit](https://huggingface.co/techwithsergiu/Qwen3.5-4B-bnb-4bit) | BNB NF4 · VLM | Qwen/Qwen3.5-4B | |
| | [techwithsergiu/Qwen3.5-text-4B](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B) | bf16 · text-only | Qwen/Qwen3.5-4B | |
| | **[techwithsergiu/Qwen3.5-text-4B-bnb-4bit](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B-bnb-4bit)** | BNB NF4 · text-only | Qwen3.5-text-4B | |
| | [techwithsergiu/Qwen3.5-text-4B-GGUF](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B-GGUF) | GGUF quants | Qwen3.5-text-4B | |
|
|
| The visual tower scales with model size (~0.19 GB for 0.8B, ~0.62 GB for 2B/4B, ~0.85 GB for 9B). |
| BNB text-only models are roughly 34% of the original f16 size (4B example: 9.32 GB → 3.12 GB). |
|
|
| ## Inference |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_name = "techwithsergiu/Qwen3.5-text-4B-bnb-4bit" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| messages = [{"role": "user", "content": "What is the capital of Romania?"}] |
| |
| # Thinking OFF — direct answer |
| text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| enable_thinking=False, |
| ) |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| outputs = model.generate(**inputs, max_new_tokens=256) |
| response = tokenizer.decode( |
| outputs[0][inputs["input_ids"].shape[1]:], |
| skip_special_tokens=True, |
| ) |
| print(response) |
| |
| # Thinking ON — chain-of-thought before the answer |
| text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| enable_thinking=True, |
| ) |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| outputs = model.generate(**inputs, max_new_tokens=1024) |
| response = tokenizer.decode( |
| outputs[0][inputs["input_ids"].shape[1]:], |
| skip_special_tokens=True, |
| ) |
| print(response) |
| ``` |
|
|
| ## Fine-tuning |
|
|
| This is the **primary training target** for text LoRA fine-tuning. |
|
|
| Training pipeline: **[github.com/techwithsergiu/qwen-qlora-train](https://techwithsergiu.github.io/qwen-qlora-train)** |
|
|
| QLoRA (Unsloth + TRL + PEFT) · rank 16–64 · validated on RTX 3070 8 GB |
|
|
| ```bash |
| # Quick start — install |
| pip install "unsloth[cu124-torch260] @ git+https://github.com/unslothai/unsloth.git" |
| pip install git+https://github.com/techwithsergiu/qwen-qlora-train.git |
| |
| # Train with a ready-made config |
| qlora-train configs/qwen35/0.8b.yaml # or 2b / 4b |
| ``` |
|
|
| After training, test the adapter without merging: |
|
|
| ```bash |
| qlora-infer \ |
| --model techwithsergiu/Qwen3.5-text-4B-bnb-4bit \ |
| --adapter adapters/<run_name> |
| ``` |
|
|
| For VLM (image + text) fine-tuning of the full model, see: |
| [unsloth.ai/docs/models/qwen3.5/fine-tune](https://unsloth.ai/docs/models/qwen3.5/fine-tune) |
|
|
| ## Pipeline diagram |
|
|
|  |
|
|
| ## Conversion |
|
|
| Converted using [qwen35-toolkit](https://techwithsergiu.github.io/qwen35-toolkit) — |
| a Python toolkit for BNB quantization, visual tower removal, verification and |
| HF Hub publishing of Qwen3.5 models. |
|
|
| --- |
|
|
| ## Acknowledgements |
|
|
| Based on [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) |
| by the Qwen Team. If you use this model in research, please cite the original: |
|
|
| ```bibtex |
| @misc{qwen3.5, |
| title = {{Qwen3.5}: Towards Native Multimodal Agents}, |
| author = {{Qwen Team}}, |
| month = {February}, |
| year = {2026}, |
| url = {https://qwen.ai/blog?id=qwen3.5} |
| } |
| ``` |
|
|