TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation
Abstract
A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems.
Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.
Community
We present TVIR, the first benchmark and agent framework specifically designed for text-visual interleaved report generation. Unlike existing text-only deep research systems, TVIR-Bench evaluates both textual quality and visual integration across 100 expert-curated tasks. Our TVIR-Agent achieves state-of-the-art performance, demonstrating that structured multi-agent collaboration is key to generating high-quality multimodal reports.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation (2026)
- ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence (2026)
- MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory (2026)
- Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation (2026)
- Towards Knowledgeable Deep Research: Framework and Benchmark (2026)
- HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration (2026)
- Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.02320 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper