A collection of benchmarks for evaluating LMs or VLMs under multi-turn interaction
Young-Jun Lee PRO
passing2961
AI & ML interests
Social Dialogue System, Multi-Modal Dialogue
Recent Activity
upvoted
a
paper
about 2 hours ago
MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
upvoted
a
paper
about 2 hours ago
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
upvoted
a
paper
about 2 hours ago
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces