Vision encoder quantization error - model crashes on image inputs

by tsetuser32323 - opened Mar 4

Mar 4

hi u broke he vision encoder layers - are incorrectly quantized, causing this error during inference with images:

AssertionError: module.weight.shape[1] == 1
File "bitsandbytes/nn/modules.py", line 415, in fix_4bit_weight_quant_state_from_module

Issue: Vision encoder (.visual.*) shouldn't be quantized with NF4, only text layers should be.

Workaround: Use Qwen/Qwen3.5-4B with --quantization bnb-4bit flag instead, or try QuantTrio/Qwen3.5-4B-AWQ.

Environment: transformers 5.3.0.dev0, bitsandbytes

techwithsergiu

Owner Mar 4

•

edited Mar 4

Hi, I think so.. yes,
Thank you for the comment,
Actually I'm trying to train LoRA adapter with limited resources (8gb vram) that's why I quantized it.
Btw - using a custom script because of the vram limits layer-by-layer.
Checked only the text generation but not the vision encoder,
anyway I'll do something with it

techwithsergiu

Owner Mar 5

I fixed a reason why it's happened, you can upload a new config version.

And check a commit:
Fixed quantization_config.llm_int8_skip_modules, to avoid re-quantizes visual tower layers on load
https://huggingface.co/techwithsergiu/Qwen3.5-4B-bnb-4bit/commit/84ac8f70ed4064fb9e9358e174b5846b86bf8b93

Also checked with e2e:

════════════════════════════════════════════════════════════
./Qwen3.5-4B-bnb-4bit [Qwen3.5 · local]
════════════════════════════════════════════════════════════
📦 Model weight size on disk: 3.52 GB

── 1. Config ────────────────────────────────────────────
✅ config.json OK | skip_modules: ['lm_head', 'model.visual']

── 2. Load ──────────────────────────────────────────────
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
Loading weights: 100%|████████████████████████████████████████████████████████████████████████████████████| 723/723 [00:00<00:00, 919.05it/s, Materializing param=model.visual.pos_embed.weight]
✅ Tokenizer loaded | vocab size: 248,044
✅ Model loaded (Qwen3_5ForConditionalGeneration) | device_map: cuda:0
VRAM : 3.5 GB used / 7.7 GB total [after load]

── 3. Quantization ──────────────────────────────────────
✅ 248/347 linear layers quantized (71%)
✅ Visual tower: 98 layer(s) correctly at full precision
...
── Summary ─────────────────────────────────────────────────
total tokens : 1776
total time : 117.58s
avg speed : 15.1 tok/s

── 5c. Inference [IMAGE] ─────────────────────────
✅ [Image] response: 'red' (expected: 'red', case-insensitive)

✅ ALL CHECKS PASSED

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment