Instructions to use ParaVT/ParaVT-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ParaVT/ParaVT-8B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ParaVT/ParaVT-8B") model = AutoModelForImageTextToText.from_pretrained("ParaVT/ParaVT-8B") - Notebooks
- Google Colab
- Kaggle
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
Overview
Training large multimodal models (LMMs) via reinforcement learning to natively invoke video-processing tools (such as temporal cropping) has become a promising route to long-video understanding. Existing native-RL methods, however, dispatch tool calls sequentially (one per turn): a single wrong crop propagates errors without peer correction, multi-turn calls corrupt context, and inference cost scales linearly with the number of turns.
ParaVT is the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling: it dispatches multiple time-window crops in a single turn for cleaner context and better fault tolerance. Applying standard RL to ParaVT surfaces an obstacle we term the Tool Prior Paradox, where the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose a skip-tool reward shortcut under temperature sampling. We address this with PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO): a targeted format reward applied only at the structural-token positions most prone to collapse, and a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it.
Model Card
This repository hosts the final post-RL checkpoint (ParaVT-8B), obtained by running PARA-GRPO on top of the cold-start SFT checkpoint mwxely/ParaVT-8B-SFT. The base architecture is Qwen3VLForConditionalGeneration, identical to Qwen/Qwen3-VL-8B-Instruct; only the language-model weights are updated.
| Field | Value |
|---|---|
| Architecture | Qwen3VLForConditionalGeneration |
| Parameters | 8 B |
| Base model | Qwen/Qwen3-VL-8B-Instruct |
| Training stages | Cold-start SFT (500 steps) → PARA-GRPO RL (54 steps) |
| Training data | ParaVT/ParaVT-Parquet (sft + rl configs) |
| Source videos | ParaVT/ParaVT-Source |
| Native tool | Temporal cropping (start time, end time, optional sub-frame count) |
Usage
ParaVT-8B is a drop-in transformers / vllm model for video-text-to-text. The full evaluation driver, prompt templates, and reproduction scripts live in the ParaVT GitHub repository; please refer to it for the exact environment that produced the reported numbers.
# Reproduce the headline numbers (after installing the eval venv)
git clone https://github.com/EvolvingLMMs-Lab/ParaVT.git && cd ParaVT
cp .secrets.env.example .secrets.env && $EDITOR .secrets.env
bash scripts/setup_env.sh eval
PARAVT_EVAL_MODEL=ParaVT/ParaVT-8B \
bash paravt/eval/scripts/reproduce_paravt_8b.sh
For inference outside the eval driver, treat the model exactly like Qwen/Qwen3-VL-8B-Instruct: vLLM --model ParaVT/ParaVT-8B, the same tokenizer, the same chat template. The agentic system prompt and the tool schema used during PARA-GRPO are documented in paravt/eval/configs/withtool.yaml and paravt/eval/utils.py.
Citation
If you find ParaVT useful for your research and applications, please cite:
@misc{yang2026paravt,
title={{ParaVT}: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning},
author={Zuhao Yang and Kaichen Zhang and Sudong Wang and Keming Wu and Zhongyu Yang and Bo Li and Xiaojuan Qi and Shijian Lu and Xingxuan Li and Lidong Bing},
year={2026},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgements
ParaVT builds on the LongVT (CVPR 2026) framework for native video tool calling, the lmms-engine cold-start SFT infrastructure, the AReaL RL training stack, and the lmms-eval evaluation harness. We thank the maintainers of all of the above.
- Downloads last month
- 31