Qwen3-VL-2B-Instruct — INT4 NF4 Quantized

Alibaba's latest Qwen3-VL-2B-Instruct quantized to 4-bit NF4 with double quantization for real-time robotic visual reasoning. 2.7x smaller — from 4.1 GB to 1.5 GB — enabling edge deployment alongside other perception models on a single GPU.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.

Why This Model Exists

Robots need to see and reason simultaneously — understanding scenes, following visual instructions, and generating structured plans from what they observe. Qwen3-VL is Alibaba's latest generation vision-language model, surpassing Qwen2.5-VL with improved visual grounding, native video understanding, and stronger instruction following. At 1.5 GB quantized, the 2B variant fits on edge GPUs alongside segmentation, depth, and feature models — making it the ideal VLM for resource-constrained robotic systems.

Model Details

Property	Value
Architecture	Qwen3-VL (vision encoder + language decoder)
Total Parameters	2B
Text Hidden Dimension	2048
Text Layers	28
Text Attention Heads	16 (8 KV heads, GQA)
Text MLP Dimension	6144 (SiLU activation)
Vision Encoder	24-layer ViT (1024d, 16 heads, patch 16)
Vision Features	DeepStack at layers [5, 11, 17]
Spatial Merge	2×2 (4 patches → 1 token)
Temporal Patch	2 frames per token
Context Length	262,144 tokens
Vocabulary	151,936 tokens
RoPE	M-RoPE (interleaved, θ = 5,000,000)
Quantization	NF4 double quantization (bitsandbytes)
Original Model	`Qwen/Qwen3-VL-2B-Instruct`
License	Apache-2.0

Compression Results

Quantized on an NVIDIA L4 24GB GPU using bitsandbytes NF4 with double quantization.

Metric	Original	INT4 Quantized	Change
Total Size	4,057 MB	1,494 MB	2.7x smaller
Quantization	BF16	NF4 + double quant	4-bit weights
Compute Dtype	BF16	BF16	Preserved at inference
Format	SafeTensors	SafeTensors	Direct HF loading

Quick Start

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model = AutoModelForImageTextToText.from_pretrained(
    "robotflowlabs/qwen3-vl-2b-instruct-int4",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("robotflowlabs/qwen3-vl-2b-instruct-int4")

image = Image.open("scene.jpg")
messages = [
    {"role": "system", "content": "You are a robotic vision assistant."},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe the objects on the table and their positions."}
    ]}
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))

With FORGE (ANIMA Integration)

from forge.vlm import VLMRegistry

vlm = VLMRegistry.load("qwen3-vl-2b-instruct-int4")
description = vlm.describe(image, "What objects can you see and where are they?")

Use Cases in ANIMA

Qwen3-VL-2B serves as the lightweight visual reasoning engine in ANIMA:

Scene Understanding — Describe workspace contents, object positions, and spatial relationships
Visual Grounding — Locate objects by natural language ("find the red cup near the edge")
Instruction Grounding — Map visual instructions to actionable robot commands
Visual QA — Answer operator questions about what the robot sees
Anomaly Detection — Identify unexpected objects or scene changes
Video Understanding — Temporal reasoning over camera feeds (native video support)

Qwen3-VL Family on RobotFlowLabs

Model	Params	Quantized Size	Best For
qwen3-vl-2b-instruct-int4	2B	1.5 GB	Edge deployment, real-time
qwen3-vl-4b-instruct-int4	4B	2.7 GB	Higher accuracy visual reasoning

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.

Other Collections

ANIMA Vision — SAM2, DINOv2, CLIP, SigLIP, Depth Anything
ANIMA Language — Qwen2.5, SmolLM2
ANIMA VLM — Qwen3-VL, Qwen2.5-VL
ANIMA VLA — SmolVLA, RDT2-FM, FORGE students

Intended Use

Designed For

On-device visual question answering and scene description
Robotic instruction grounding from images and video
Structured output generation (JSON scene graphs, object lists)
Multi-turn visual dialogue with human operators

Limitations

INT4 quantization may slightly reduce visual grounding precision
262K context window is generous but may not cover very long video sequences
Requires GPU (bitsandbytes NF4 does not run on CPU)
Inherits biases from Qwen3-VL training data

Out of Scope

Safety-critical autonomous decision making without human oversight
Medical image analysis
Surveillance applications

Technical Details

Compression Pipeline

Original Qwen3-VL-2B-Instruct (BF16, 4.1 GB)
    │
    └─→ bitsandbytes NF4 double quantization
        ├─→ bnb_4bit_quant_type: nf4
        ├─→ bnb_4bit_use_double_quant: true
        ├─→ bnb_4bit_compute_dtype: bfloat16
        └─→ model.safetensors (1.5 GB)

Quantization: NF4 (Normal Float 4-bit) with double quantization via bitsandbytes
Compute: BF16 at inference — weights dequantized on-the-fly
Hardware: NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14

Attribution

Original Model: Qwen/Qwen3-VL-2B-Instruct by Alibaba Cloud
License: Apache-2.0
Compressed by: RobotFlowLabs using FORGE

Citation

@article{qwen3vl,
  title={Qwen3-VL Technical Report},
  author={Qwen Team},
  year={2025}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.

Downloads last month: 24

Safetensors

Model size

2B params

Tensor type

F32

BF16

Model tree for robotflowlabs/qwen3-vl-2b-instruct-int4

Base model

Qwen/Qwen3-VL-2B-Instruct

Quantized

(57)

this model

Collection including robotflowlabs/qwen3-vl-2b-instruct-int4

ANIMA VLM

Collection

INT4 vision-language models for robotic scene understanding. Qwen2.5-VL for visual QA and grounding. • 2 items • Updated 21 days ago

Evaluation results

Model Size (MB)
self-reported

1494.000
Compression Ratio
self-reported

2.700
Original Size (MB)
self-reported

4057.000