ClawBench — Browser Agent Benchmark Suite Collection Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces — everything you need to run, regrade, or compare on ClawBench. • 5 items • Updated 21 days ago • 1
Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports Paper • 2510.02190 • Published Jan 29 • 20
Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction Paper • 2605.05242 • Published 30 days ago • 119
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling Paper • 2604.28185 • Published Apr 30 • 90
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation Paper • 2604.24763 • Published Apr 27 • 71
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search Paper • 2503.10582 • Published Mar 13, 2025 • 25
Watch Before You Answer: Learning from Visually Grounded Post-Training Paper • 2604.05117 • Published Apr 6 • 36
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding Paper • 2604.05015 • Published Apr 6 • 236
AI Paper of the Day Collection A collection of papers that I think are interesting, one added each day • 640 items • Updated 7 days ago • 97