-
ClawBench: Can AI Agents Complete Everyday Online Tasks?
Paper • 2604.08523 • Published • 253 -
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
Paper • 2604.06132 • Published • 114 -
FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios
Paper • 2604.07413 • Published • 89 -
GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers
Paper • 2604.02648 • Published • 45
bogeumkim
bogeumkim
AI & ML interests
NLP
Recent Activity
updated a collection 3 days ago
eval-papers-collection updated a collection 3 days ago
eval-papers-collection updated a collection 3 days ago
eval-papers-collection