ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
Abstract
ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.
Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.
Community
Long-video understanding is becoming agentic where LMMs are post-trained with RL to natively invoke video tools (e.g., temporal cropping). But every existing native-RL recipe (including our own LongVT @ CVPR 2026) dispatches tool calls sequentially, one per turn: a bad crop has no peer correction, multi-turn calls drift the context, and inference cost grows linearly with turns.
ParaVT is the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling. A main agent emits multiple temporal-window crops in a single turn, weight-sharing sub-agents process them concurrently, and a gather-and-reason step produces the final answer.
But applying standard GRPO on top of a tool-native LMM surfaces two coupled failures driven by the same pretrained tool prior. We call this the Tool Prior Paradox:
Format Fragility ā SFT-learned <think> / <tool_call> / <answer> closures collapse under temperature sampling.
Tool Necessity Gap ā with a 64-frame overview, "skip-tool" becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.
We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ā¼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.
Fully open: paper, code, weights, data
š arxiv.org/abs/2605.20342 Ā· š» github.com/EvolvingLMMs-Lab/ParaVT Ā· š¤ https://huggingface.co/ParaVT Ā· š evolvinglmms-lab.github.io/ParaVT
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/paravt-taming-the-tool-prior-paradox-for-parallel-tool-use-in-agentic-video-reinforcement-learning-9791-ad672079
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL (2026)
- VISD: Enhancing Video Reasoning via Structured Self-Distillation (2026)
- RECIPE: Procedural Planning via Grounding in Instructional Video (2026)
- AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers (2026)
- Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models (2026)
- ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding (2026)
- Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.20342 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash