ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Overview

Training large multimodal models (LMMs) via reinforcement learning to natively invoke video-processing tools (such as temporal cropping) has become a promising route to long-video understanding. Existing native-RL methods, however, dispatch tool calls sequentially (one per turn): a single wrong crop propagates errors without peer correction, multi-turn calls corrupt context, and inference cost scales linearly with the number of turns.

ParaVT is the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling: it dispatches multiple time-window crops in a single turn for cleaner context and better fault tolerance. Applying standard RL to ParaVT surfaces an obstacle we term the Tool Prior Paradox, where the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose a skip-tool reward shortcut under temperature sampling. We address this with PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO): a targeted format reward applied only at the structural-token positions most prone to collapse, and a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it.

Model Card

This repository hosts the final post-RL checkpoint (ParaVT-8B), obtained by running PARA-GRPO on top of the cold-start SFT checkpoint mwxely/ParaVT-8B-SFT. The base architecture is Qwen3VLForConditionalGeneration, identical to Qwen/Qwen3-VL-8B-Instruct; only the language-model weights are updated.

Field	Value
Architecture	`Qwen3VLForConditionalGeneration`
Parameters	8 B
Base model	`Qwen/Qwen3-VL-8B-Instruct`
Training stages	Cold-start SFT (500 steps) → PARA-GRPO RL (54 steps)
Training data	`ParaVT/ParaVT-Parquet` (`sft` + `rl` configs)
Source videos	`ParaVT/ParaVT-Source`
Native tool	Temporal cropping (start time, end time, optional sub-frame count)

Usage

ParaVT-8B is a drop-in transformers / vllm model for video-text-to-text. The full evaluation driver, prompt templates, and reproduction scripts live in the ParaVT GitHub repository; please refer to it for the exact environment that produced the reported numbers.

# Reproduce the headline numbers (after installing the eval venv)
git clone https://github.com/EvolvingLMMs-Lab/ParaVT.git && cd ParaVT
cp .secrets.env.example .secrets.env && $EDITOR .secrets.env
bash scripts/setup_env.sh eval
PARAVT_EVAL_MODEL=ParaVT/ParaVT-8B \
    bash paravt/eval/scripts/reproduce_paravt_8b.sh

For inference outside the eval driver, treat the model exactly like Qwen/Qwen3-VL-8B-Instruct: vLLM --model ParaVT/ParaVT-8B, the same tokenizer, the same chat template. The agentic system prompt and the tool schema used during PARA-GRPO are documented in paravt/eval/configs/withtool.yaml and paravt/eval/utils.py.

Citation

If you find ParaVT useful for your research and applications, please cite:

@article{yang2026paravt,
  title={ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning},
  author={Yang, Zuhao and Zhang, Kaichen and Wang, Sudong and Wu, Keming and Yang, Zhongyu and Li, Bo and Qi, Xiaojuan and Lu, Shijian and Li, Xingxuan and Bing, Lidong},
  journal={arXiv preprint arXiv:2605.20342},
  year={2026}
}

Acknowledgements

ParaVT builds on the LongVT (CVPR 2026) framework for native video tool calling, the lmms-engine cold-start SFT infrastructure, the AReaL RL training stack, and the lmms-eval evaluation harness. We thank the maintainers of all of the above.