nielsr HF Staff

Update model card metadata and usage information

4d11dc0 verified 3 days ago

3.59 kB

base_model:
  - Qwen/Qwen3-VL-4B-Instruct
language:
  - en
license: apache-2.0
pipeline_tag: image-to-image
library_name: transformers
tags:
  - autonomous-driving
  - vision-language-action
  - chain-of-thought
  - trajectory-prediction
  - VLA

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

📄 Paper (arXiv) | 💻 GitHub | 🌐 Project Page

OneVL is a Vision-Language-Action (VLA) framework for autonomous driving that achieves state-of-the-art trajectory prediction accuracy while matching the inference latency of answer-only autoregressive models.

Overview

OneVL addresses the limitations of prior latent Chain-of-Thought (CoT) methods by introducing dual-modal auxiliary decoders. These decoders force compact latent tokens to encode both human-readable reasoning and future scene dynamics. During inference, these decoders are discarded, and the latent tokens are prefilled into the context in a single parallel pass, achieving high performance at answer-only speeds.

Key Architecture Components

Latent Token Interface: 4 visual and 2 language latent tokens inserted before the response.
Visual Auxiliary Decoder: Acts as a world model, predicting future-frame visual tokens (at t+0.5s and t+1.0s).
Language Auxiliary Decoder: Reconstructs explicit CoT reasoning text from language latent hidden states.
Prefill Inference: Enables 1.5× to 2.3× speedup over explicit autoregressive CoT.

Usage

Requirements

Python 3.10+, CUDA GPU (≥16 GB VRAM recommended)
transformers >= 4.57.0 (required for Qwen3VLForConditionalGeneration)

# Environment Setup
uv venv venv/onevl --python 3.12
source venv/onevl/bin/activate
pip install -r requirements.txt

Inference (Trajectory Prediction Only)

python infer_onevl.py \
    --model_path /path/to/OneVL-checkpoint \
    --test_set_path test_data/navsim_test.json \
    --image_base_path "" \
    --output_path output/navsim/results.json \
    --device cuda:0 \
    --num_latent 2 --num_latent_vis 4 \
    --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0

For full inference options, including language and visual explanations, please refer to the GitHub repository.

Results

OneVL is the first latent CoT method to surpass explicit autoregressive CoT across all major autonomous driving benchmarks.

Benchmark	Metric	AR CoT+Answer	OneVL
NAVSIM	PDM-score ↑	88.29	88.84
ROADWork	ADE (px) ↓	13.18	12.49
Impromptu	ADE (m) ↓	1.42	1.34
APR1	ADE (m) ↓	2.99	2.62

Citation

@article{lu2026onevl,
  title={OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation},
  author={Lu, Jinghui and Guan, Jiayi and Huang, Zhijian and Li, Jinlong and Li, Guang and Kong, Lingdong and Li, Yingyan and Wang, Han and Xu, Shaoqing and Luo, Yuechen and others},
  journal={arXiv preprint arXiv:2604.18486},
  year={2026},
  url={https://arxiv.org/abs/2604.18486}
}

License

This project is released under the Apache 2.0 License. Model weights are built on Qwen3-VL-4B-Instruct and the visual tokenizer is from Emu3.5-VisionTokenizer.