nielsr HF Staff commited on
Commit
dacd423
Β·
verified Β·
1 Parent(s): ab0c645

Update model card metadata and add library info for OneVL

Browse files

Hi! I'm Niels from the Hugging Face team. I've updated the model card for OneVL to include relevant metadata and structured information.
- Added `library_name: transformers` as the model is compatible with the Transformers library (specifically using the `Qwen3VLForConditionalGeneration` architecture).
- Updated the `pipeline_tag` to `image-to-image` as requested.
- Added structured documentation including an overview, architecture details, and sample usage snippets for inference directly from the official repository.

Files changed (1) hide show
  1. README.md +14 -111
README.md CHANGED
@@ -1,121 +1,62 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
 
 
 
5
  tags:
6
  - autonomous-driving
7
  - vision-language-action
8
  - chain-of-thought
9
  - trajectory-prediction
10
  - VLA
11
- base_model:
12
- - Qwen/Qwen3-VL-4B-Instruct
13
- pipeline_tag: image-text-to-text
14
  ---
15
 
16
  # OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
17
 
18
  **[πŸ“„ Paper (arXiv)](https://arxiv.org/abs/2604.18486)** | **[πŸ’» GitHub](https://github.com/xiaomi-research/onevl)** | **[🌐 Project Page](https://Xiaomi-Embodied-Intelligence.github.io/OneVL/)**
19
 
20
- *Xiaomi Embodied Intelligence Team*
21
-
22
- ---
23
 
24
  ## Overview
25
 
26
- **OneVL** is a Vision-Language-Action (VLA) framework for autonomous driving that achieves **state-of-the-art trajectory prediction accuracy** while matching the inference latency of answer-only autoregressive models.
27
-
28
  Prior latent Chain-of-Thought (CoT) methods compress reasoning into opaque hidden states β€” fast, but consistently underperform explicit CoT on driving tasks. OneVL identifies the root cause: purely linguistic latents encode abstract semantic labels rather than the spatiotemporal causal dynamics that govern real driving scenes. OneVL addresses this with **dual-modal auxiliary decoders** that force compact latent tokens to encode both human-readable reasoning *and* future scene dynamics simultaneously.
29
 
30
  At inference, both decoders are discarded and all latents are **prefilled** into the prompt context in a single parallel pass β€” matching answer-only AR prediction speed while recovering the interpretability of explicit CoT in both vision and language.
31
 
32
- OneVL is the **first latent CoT method to surpass explicit autoregressive CoT** across all four driving benchmarks.
33
-
34
- ---
35
-
36
  ## Architecture
37
 
38
- OneVL augments **Qwen3-VL-4B-Instruct** with three components:
39
-
40
- **Latent Token Interface** β€” 4 visual latent tokens + 2 language latent tokens are inserted in the assistant response before the answer, using existing vocabulary tokens (no new special tokens added).
41
-
42
- **Visual Auxiliary Decoder** β€” Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (using the Emu3.5 IBQ 131k codebook). Acts as a **world model** supervision signal that forces the latent space to encode genuine physical scene dynamics β€” agent trajectories, road geometry, and environmental change β€” rather than abstract descriptions.
43
-
44
- **Language Auxiliary Decoder** β€” Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features. Recovers 97% of explicit CoT text quality while running at answer-only speed.
45
 
46
- **Prefill Inference** β€” Both decoders are discarded at inference time. All latent tokens are processed in a single parallel prefill pass; only the trajectory answer is generated autoregressively. This achieves **1.5Γ— speedup over explicit CoT on NAVSIM** and **2.3Γ— on ROADWork**.
47
-
48
- ### Three-Stage Training Pipeline
49
-
50
- Training proceeds in three stages to ensure stable joint optimization:
51
- - **Stage 0**: Main model warmup (trajectory prediction)
52
- - **Stage 1**: Auxiliary decoder warmup (language + visual decoders independently)
53
- - **Stage 2**: Joint end-to-end fine-tuning (all components together)
54
-
55
- Staged training is essential β€” ablation shows that skipping it collapses PDM-score from 88.84 to 67.13.
56
-
57
- ---
58
 
59
  ## Results
60
 
61
- ### NAVSIM
62
 
 
63
  | Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
64
  |---|:---:|:---:|:---:|:---:|
65
- | AR Answer | 4B | 87.47 | 4.49 | β€” |
66
  | AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
67
- | COCONUT | 4B | 84.84 | 5.93 | β€” |
68
- | CODI | 4B | 83.92 | 8.62 | β€” |
69
- | SIM-CoT | 4B | 84.21 | 10.86 | Language |
70
  | **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |
71
 
72
  ### ROADWork
73
-
74
  | Method | ADE (px) ↓ | FDE (px) ↓ | Latency (s) ↓ |
75
  |---|:---:|:---:|:---:|
76
  | AR CoT+Answer | 13.18 | 29.98 | 10.74 |
77
  | **OneVL** | **12.49** | **28.80** | **4.71** |
78
 
79
- ### Impromptu
80
-
81
- | Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
82
- |---|:---:|:---:|:---:|
83
- | AR CoT+Answer | 1.42 | 3.96 | 6.84 |
84
- | **OneVL** | **1.34** | **3.70** | **4.02** |
85
-
86
- ### APR1 (Alpamayo-R1)
87
-
88
- | Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
89
- |---|:---:|:---:|:---:|
90
- | AR CoT+Answer | 2.99 | 8.54 | 3.51 |
91
- | **OneVL** | **2.62** | 7.53 | **3.26** |
92
-
93
- ### CoT Text Quality (NAVSIM)
94
-
95
- | Method | Meta Action Acc. ↑ | STS Score ↑ | LLM Judge ↑ | Latency (s) ↓ |
96
- |---|:---:|:---:|:---:|:---:|
97
- | AR CoT+Answer | 73.20 | 79.75 | 81.86 | 6.58 |
98
- | **OneVL** | 71.00 | 78.26 | 79.13 | **4.46** |
99
-
100
- OneVL's language auxiliary decoder recovers 97% of explicit CoT quality at answer-only inference speed.
101
-
102
- ---
103
-
104
  ## Usage
105
 
106
  ### Requirements
107
-
108
- - Python 3.10+, CUDA GPU (β‰₯16 GB VRAM recommended)
109
  - `transformers >= 4.57.0` (required for `Qwen3VLForConditionalGeneration`)
110
 
111
- ```bash
112
- uv venv venv/onevl --python 3.12
113
- source venv/onevl/bin/activate
114
- pip install -r requirements.txt
115
- ```
116
-
117
  ### Inference (Trajectory Prediction Only)
118
-
119
  ```bash
120
  python infer_onevl.py \
121
  --model_path /path/to/OneVL-checkpoint \
@@ -128,7 +69,6 @@ python infer_onevl.py \
128
  ```
129
 
130
  ### Inference with Language + Visual Explanation
131
-
132
  ```bash
133
  python infer_onevl.py \
134
  --model_path /path/to/OneVL-checkpoint \
@@ -144,39 +84,6 @@ python infer_onevl.py \
144
  --c_thought_visual 4 --max_visual_tokens 2560
145
  ```
146
 
147
- ### Multi-GPU Inference
148
-
149
- ```bash
150
- export MODEL_PATH=/path/to/OneVL-checkpoint
151
- export TEST_SET_PATH=test_data/navsim_test.json
152
- export OUTPUT_PATH=output/navsim/navsim_results.json
153
- bash run_infer.sh
154
- ```
155
-
156
- Per-benchmark scripts are available in `scripts/`:
157
-
158
- ```bash
159
- bash scripts/infer_navsim.sh
160
- bash scripts/infer_ar1.sh
161
- bash scripts/infer_roadwork.sh
162
- bash scripts/infer_impromptu.sh
163
- ```
164
-
165
- For full documentation, evaluation scripts, and data format details, see the [GitHub repository](https://github.com/xiaomi-research/onevl).
166
-
167
- ---
168
-
169
- ## Open-Source Status
170
-
171
- | Component | Status |
172
- |---|:---:|
173
- | Technical Report | βœ… Released |
174
- | Model Weights | βœ… Released |
175
- | Inference Code | βœ… Released |
176
- | Training Code | πŸ”œ Coming Soon |
177
-
178
- ---
179
-
180
  ## Citation
181
 
182
  ```bibtex
@@ -189,10 +96,6 @@ For full documentation, evaluation scripts, and data format details, see the [Gi
189
  }
190
  ```
191
 
192
- ---
193
-
194
  ## License
195
-
196
  Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
197
-
198
- Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-VL-4B-Instruct
4
  language:
5
  - en
6
+ license: apache-2.0
7
+ pipeline_tag: image-to-image
8
+ library_name: transformers
9
  tags:
10
  - autonomous-driving
11
  - vision-language-action
12
  - chain-of-thought
13
  - trajectory-prediction
14
  - VLA
 
 
 
15
  ---
16
 
17
  # OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
18
 
19
  **[πŸ“„ Paper (arXiv)](https://arxiv.org/abs/2604.18486)** | **[πŸ’» GitHub](https://github.com/xiaomi-research/onevl)** | **[🌐 Project Page](https://Xiaomi-Embodied-Intelligence.github.io/OneVL/)**
20
 
21
+ OneVL is a Vision-Language-Action (VLA) framework for autonomous driving that achieves state-of-the-art trajectory prediction accuracy while matching the inference latency of answer-only autoregressive models.
 
 
22
 
23
  ## Overview
24
 
 
 
25
  Prior latent Chain-of-Thought (CoT) methods compress reasoning into opaque hidden states β€” fast, but consistently underperform explicit CoT on driving tasks. OneVL identifies the root cause: purely linguistic latents encode abstract semantic labels rather than the spatiotemporal causal dynamics that govern real driving scenes. OneVL addresses this with **dual-modal auxiliary decoders** that force compact latent tokens to encode both human-readable reasoning *and* future scene dynamics simultaneously.
26
 
27
  At inference, both decoders are discarded and all latents are **prefilled** into the prompt context in a single parallel pass β€” matching answer-only AR prediction speed while recovering the interpretability of explicit CoT in both vision and language.
28
 
 
 
 
 
29
  ## Architecture
30
 
31
+ OneVL augments **Qwen3-VL-4B-Instruct** with:
 
 
 
 
 
 
32
 
33
+ - **Latent Token Interface**: 4 visual latent tokens + 2 language latent tokens inserted in the assistant response before the answer.
34
+ - **Visual Auxiliary Decoder**: Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (using the Emu3.5 IBQ codebook), acting as a world model.
35
+ - **Language Auxiliary Decoder**: Reconstructs explicit CoT reasoning text from language latent hidden states.
36
+ - **Prefill Inference**: Both decoders are discarded at inference; latent tokens are processed in one parallel pass with only the trajectory generated autoregressively.
 
 
 
 
 
 
 
 
37
 
38
  ## Results
39
 
40
+ OneVL is the first latent CoT method to surpass explicit autoregressive CoT across major driving benchmarks.
41
 
42
+ ### NAVSIM
43
  | Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
44
  |---|:---:|:---:|:---:|:---:|
 
45
  | AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
 
 
 
46
  | **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |
47
 
48
  ### ROADWork
 
49
  | Method | ADE (px) ↓ | FDE (px) ↓ | Latency (s) ↓ |
50
  |---|:---:|:---:|:---:|
51
  | AR CoT+Answer | 13.18 | 29.98 | 10.74 |
52
  | **OneVL** | **12.49** | **28.80** | **4.71** |
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ## Usage
55
 
56
  ### Requirements
 
 
57
  - `transformers >= 4.57.0` (required for `Qwen3VLForConditionalGeneration`)
58
 
 
 
 
 
 
 
59
  ### Inference (Trajectory Prediction Only)
 
60
  ```bash
61
  python infer_onevl.py \
62
  --model_path /path/to/OneVL-checkpoint \
 
69
  ```
70
 
71
  ### Inference with Language + Visual Explanation
 
72
  ```bash
73
  python infer_onevl.py \
74
  --model_path /path/to/OneVL-checkpoint \
 
84
  --c_thought_visual 4 --max_visual_tokens 2560
85
  ```
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  ## Citation
88
 
89
  ```bibtex
 
96
  }
97
  ```
98
 
 
 
99
  ## License
 
100
  Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
101
+ Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer).