--- tags: - techwithsergiu library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-4B/blob/main/LICENSE pipeline_tag: text-generation base_model: - techwithsergiu/Qwen3.5-text-4B --- # Qwen3.5-text-4B-bnb-4bit BNB NF4 4-bit quantization of [techwithsergiu/Qwen3.5-text-4B](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B) — a text-only derivative of [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B). **No visual tower** — text input only. This is the recommended base for Unsloth LoRA text fine-tuning: smaller VRAM footprint, no visual-dependency complexity, cleaner adapter targeting. Inference has been verified. LoRA fine-tuning docs are pending — see Fine-tuning section below. ## What was changed from the original Qwen3.5-4B - Visual tower removed (same as `Qwen3.5-text-4B`) - Text backbone quantized to BNB NF4 double-quant (`bnb_4bit_quant_type=nf4`, `bnb_4bit_compute_dtype=bfloat16`) - `lm_head.weight` kept at **bf16** for output quality / stability ## Model family ![](diagrams/diagram_01.png) | Model | Type | Base model | |---|---|---| | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) | f16 · VLM · source | — | | [techwithsergiu/Qwen3.5-4B-bnb-4bit](https://huggingface.co/techwithsergiu/Qwen3.5-4B-bnb-4bit) | BNB NF4 · VLM | Qwen/Qwen3.5-4B | | [techwithsergiu/Qwen3.5-text-4B](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B) | bf16 · text-only | Qwen/Qwen3.5-4B | | **[techwithsergiu/Qwen3.5-text-4B-bnb-4bit](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B-bnb-4bit)** | BNB NF4 · text-only | Qwen3.5-text-4B | | [techwithsergiu/Qwen3.5-text-4B-GGUF](https://huggingface.co/techwithsergiu/Qwen3.5-text-4B-GGUF) | GGUF quants | Qwen3.5-text-4B | The visual tower scales with model size (~0.19 GB for 0.8B, ~0.62 GB for 2B/4B, ~0.85 GB for 9B). BNB text-only models are roughly 34% of the original f16 size (4B example: 9.32 GB → 3.12 GB). ## Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "techwithsergiu/Qwen3.5-text-4B-bnb-4bit" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", trust_remote_code=True, ) messages = [{"role": "user", "content": "What is the capital of Romania?"}] # Thinking OFF — direct answer text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False, ) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) response = tokenizer.decode( outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True, ) print(response) # Thinking ON — chain-of-thought before the answer text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=True, ) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=1024) response = tokenizer.decode( outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True, ) print(response) ``` ## Fine-tuning This is the **primary training target** for text LoRA fine-tuning. Training pipeline: **[github.com/techwithsergiu/qwen-qlora-train](https://techwithsergiu.github.io/qwen-qlora-train)** QLoRA (Unsloth + TRL + PEFT) · rank 16–64 · validated on RTX 3070 8 GB ```bash # Quick start — install pip install "unsloth[cu124-torch260] @ git+https://github.com/unslothai/unsloth.git" pip install git+https://github.com/techwithsergiu/qwen-qlora-train.git # Train with a ready-made config qlora-train configs/qwen35/0.8b.yaml # or 2b / 4b ``` After training, test the adapter without merging: ```bash qlora-infer \ --model techwithsergiu/Qwen3.5-text-4B-bnb-4bit \ --adapter adapters/ ``` For VLM (image + text) fine-tuning of the full model, see: [unsloth.ai/docs/models/qwen3.5/fine-tune](https://unsloth.ai/docs/models/qwen3.5/fine-tune) ## Pipeline diagram ![](diagrams/diagram_02.png) ## Conversion Converted using [qwen35-toolkit](https://techwithsergiu.github.io/qwen35-toolkit) — a Python toolkit for BNB quantization, visual tower removal, verification and HF Hub publishing of Qwen3.5 models. --- ## Acknowledgements Based on [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) by the Qwen Team. If you use this model in research, please cite the original: ```bibtex @misc{qwen3.5, title = {{Qwen3.5}: Towards Native Multimodal Agents}, author = {{Qwen Team}}, month = {February}, year = {2026}, url = {https://qwen.ai/blog?id=qwen3.5} } ```