Qwen3-VL-2B-Instruct — INT4 NF4 Quantized
Alibaba's latest Qwen3-VL-2B-Instruct quantized to 4-bit NF4 with double quantization for real-time robotic visual reasoning. 2.7x smaller — from 4.1 GB to 1.5 GB — enabling edge deployment alongside other perception models on a single GPU.
This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.
Why This Model Exists
Robots need to see and reason simultaneously — understanding scenes, following visual instructions, and generating structured plans from what they observe. Qwen3-VL is Alibaba's latest generation vision-language model, surpassing Qwen2.5-VL with improved visual grounding, native video understanding, and stronger instruction following. At 1.5 GB quantized, the 2B variant fits on edge GPUs alongside segmentation, depth, and feature models — making it the ideal VLM for resource-constrained robotic systems.
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen3-VL (vision encoder + language decoder) |
| Total Parameters | 2B |
| Text Hidden Dimension | 2048 |
| Text Layers | 28 |
| Text Attention Heads | 16 (8 KV heads, GQA) |
| Text MLP Dimension | 6144 (SiLU activation) |
| Vision Encoder | 24-layer ViT (1024d, 16 heads, patch 16) |
| Vision Features | DeepStack at layers [5, 11, 17] |
| Spatial Merge | 2×2 (4 patches → 1 token) |
| Temporal Patch | 2 frames per token |
| Context Length | 262,144 tokens |
| Vocabulary | 151,936 tokens |
| RoPE | M-RoPE (interleaved, θ = 5,000,000) |
| Quantization | NF4 double quantization (bitsandbytes) |
| Original Model | Qwen/Qwen3-VL-2B-Instruct |
| License | Apache-2.0 |
Compression Results
Quantized on an NVIDIA L4 24GB GPU using bitsandbytes NF4 with double quantization.
| Metric | Original | INT4 Quantized | Change |
|---|---|---|---|
| Total Size | 4,057 MB | 1,494 MB | 2.7x smaller |
| Quantization | BF16 | NF4 + double quant | 4-bit weights |
| Compute Dtype | BF16 | BF16 | Preserved at inference |
| Format | SafeTensors | SafeTensors | Direct HF loading |
Quick Start
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
model = AutoModelForImageTextToText.from_pretrained(
"robotflowlabs/qwen3-vl-2b-instruct-int4",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("robotflowlabs/qwen3-vl-2b-instruct-int4")
image = Image.open("scene.jpg")
messages = [
{"role": "system", "content": "You are a robotic vision assistant."},
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe the objects on the table and their positions."}
]}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))
With FORGE (ANIMA Integration)
from forge.vlm import VLMRegistry
vlm = VLMRegistry.load("qwen3-vl-2b-instruct-int4")
description = vlm.describe(image, "What objects can you see and where are they?")
Use Cases in ANIMA
Qwen3-VL-2B serves as the lightweight visual reasoning engine in ANIMA:
- Scene Understanding — Describe workspace contents, object positions, and spatial relationships
- Visual Grounding — Locate objects by natural language ("find the red cup near the edge")
- Instruction Grounding — Map visual instructions to actionable robot commands
- Visual QA — Answer operator questions about what the robot sees
- Anomaly Detection — Identify unexpected objects or scene changes
- Video Understanding — Temporal reasoning over camera feeds (native video support)
Qwen3-VL Family on RobotFlowLabs
| Model | Params | Quantized Size | Best For |
|---|---|---|---|
| qwen3-vl-2b-instruct-int4 | 2B | 1.5 GB | Edge deployment, real-time |
| qwen3-vl-4b-instruct-int4 | 4B | 2.7 GB | Higher accuracy visual reasoning |
About ANIMA
ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.
Other Collections
- ANIMA Vision — SAM2, DINOv2, CLIP, SigLIP, Depth Anything
- ANIMA Language — Qwen2.5, SmolLM2
- ANIMA VLM — Qwen3-VL, Qwen2.5-VL
- ANIMA VLA — SmolVLA, RDT2-FM, FORGE students
Intended Use
Designed For
- On-device visual question answering and scene description
- Robotic instruction grounding from images and video
- Structured output generation (JSON scene graphs, object lists)
- Multi-turn visual dialogue with human operators
Limitations
- INT4 quantization may slightly reduce visual grounding precision
- 262K context window is generous but may not cover very long video sequences
- Requires GPU (bitsandbytes NF4 does not run on CPU)
- Inherits biases from Qwen3-VL training data
Out of Scope
- Safety-critical autonomous decision making without human oversight
- Medical image analysis
- Surveillance applications
Technical Details
Compression Pipeline
Original Qwen3-VL-2B-Instruct (BF16, 4.1 GB)
│
└─→ bitsandbytes NF4 double quantization
├─→ bnb_4bit_quant_type: nf4
├─→ bnb_4bit_use_double_quant: true
├─→ bnb_4bit_compute_dtype: bfloat16
└─→ model.safetensors (1.5 GB)
- Quantization: NF4 (Normal Float 4-bit) with double quantization via bitsandbytes
- Compute: BF16 at inference — weights dequantized on-the-fly
- Hardware: NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14
Attribution
- Original Model:
Qwen/Qwen3-VL-2B-Instructby Alibaba Cloud - License: Apache-2.0
- Compressed by: RobotFlowLabs using FORGE
Citation
@article{qwen3vl,
title={Qwen3-VL Technical Report},
author={Qwen Team},
year={2025}
}
Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.
- Downloads last month
- 24
Model tree for robotflowlabs/qwen3-vl-2b-instruct-int4
Base model
Qwen/Qwen3-VL-2B-InstructCollection including robotflowlabs/qwen3-vl-2b-instruct-int4
Evaluation results
- Model Size (MB)self-reported1494.000
- Compression Ratioself-reported2.700
- Original Size (MB)self-reported4057.000