Qwen3-VL-2B-Instruct — INT4 NF4 Quantized

Alibaba's latest Qwen3-VL-2B-Instruct quantized to 4-bit NF4 with double quantization for real-time robotic visual reasoning. 2.7x smaller — from 4.1 GB to 1.5 GB — enabling edge deployment alongside other perception models on a single GPU.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.

Why This Model Exists

Robots need to see and reason simultaneously — understanding scenes, following visual instructions, and generating structured plans from what they observe. Qwen3-VL is Alibaba's latest generation vision-language model, surpassing Qwen2.5-VL with improved visual grounding, native video understanding, and stronger instruction following. At 1.5 GB quantized, the 2B variant fits on edge GPUs alongside segmentation, depth, and feature models — making it the ideal VLM for resource-constrained robotic systems.

Model Details

Property Value
Architecture Qwen3-VL (vision encoder + language decoder)
Total Parameters 2B
Text Hidden Dimension 2048
Text Layers 28
Text Attention Heads 16 (8 KV heads, GQA)
Text MLP Dimension 6144 (SiLU activation)
Vision Encoder 24-layer ViT (1024d, 16 heads, patch 16)
Vision Features DeepStack at layers [5, 11, 17]
Spatial Merge 2×2 (4 patches → 1 token)
Temporal Patch 2 frames per token
Context Length 262,144 tokens
Vocabulary 151,936 tokens
RoPE M-RoPE (interleaved, θ = 5,000,000)
Quantization NF4 double quantization (bitsandbytes)
Original Model Qwen/Qwen3-VL-2B-Instruct
License Apache-2.0

Compression Results

Quantized on an NVIDIA L4 24GB GPU using bitsandbytes NF4 with double quantization.

Metric Original INT4 Quantized Change
Total Size 4,057 MB 1,494 MB 2.7x smaller
Quantization BF16 NF4 + double quant 4-bit weights
Compute Dtype BF16 BF16 Preserved at inference
Format SafeTensors SafeTensors Direct HF loading

Quick Start

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model = AutoModelForImageTextToText.from_pretrained(
    "robotflowlabs/qwen3-vl-2b-instruct-int4",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("robotflowlabs/qwen3-vl-2b-instruct-int4")

image = Image.open("scene.jpg")
messages = [
    {"role": "system", "content": "You are a robotic vision assistant."},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe the objects on the table and their positions."}
    ]}
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))

With FORGE (ANIMA Integration)

from forge.vlm import VLMRegistry

vlm = VLMRegistry.load("qwen3-vl-2b-instruct-int4")
description = vlm.describe(image, "What objects can you see and where are they?")

Use Cases in ANIMA

Qwen3-VL-2B serves as the lightweight visual reasoning engine in ANIMA:

  • Scene Understanding — Describe workspace contents, object positions, and spatial relationships
  • Visual Grounding — Locate objects by natural language ("find the red cup near the edge")
  • Instruction Grounding — Map visual instructions to actionable robot commands
  • Visual QA — Answer operator questions about what the robot sees
  • Anomaly Detection — Identify unexpected objects or scene changes
  • Video Understanding — Temporal reasoning over camera feeds (native video support)

Qwen3-VL Family on RobotFlowLabs

Model Params Quantized Size Best For
qwen3-vl-2b-instruct-int4 2B 1.5 GB Edge deployment, real-time
qwen3-vl-4b-instruct-int4 4B 2.7 GB Higher accuracy visual reasoning

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.

Other Collections

Intended Use

Designed For

  • On-device visual question answering and scene description
  • Robotic instruction grounding from images and video
  • Structured output generation (JSON scene graphs, object lists)
  • Multi-turn visual dialogue with human operators

Limitations

  • INT4 quantization may slightly reduce visual grounding precision
  • 262K context window is generous but may not cover very long video sequences
  • Requires GPU (bitsandbytes NF4 does not run on CPU)
  • Inherits biases from Qwen3-VL training data

Out of Scope

  • Safety-critical autonomous decision making without human oversight
  • Medical image analysis
  • Surveillance applications

Technical Details

Compression Pipeline

Original Qwen3-VL-2B-Instruct (BF16, 4.1 GB)
    │
    └─→ bitsandbytes NF4 double quantization
        ├─→ bnb_4bit_quant_type: nf4
        ├─→ bnb_4bit_use_double_quant: true
        ├─→ bnb_4bit_compute_dtype: bfloat16
        └─→ model.safetensors (1.5 GB)
  • Quantization: NF4 (Normal Float 4-bit) with double quantization via bitsandbytes
  • Compute: BF16 at inference — weights dequantized on-the-fly
  • Hardware: NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14

Attribution

Citation

@article{qwen3vl,
  title={Qwen3-VL Technical Report},
  author={Qwen Team},
  year={2025}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.

Downloads last month
24
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for robotflowlabs/qwen3-vl-2b-instruct-int4

Quantized
(57)
this model

Collection including robotflowlabs/qwen3-vl-2b-instruct-int4

Evaluation results