ORCA: Open-ended Response Correctness Assessment for Audio Question Answering
Abstract
ORCA is a model-based approach for assessing large audio language models that achieves high correlation with human ratings and identifies problematic benchmark items through variance prediction.
Reliable assessment of the abilities of large audio language models (LALMs) is essential to advancing the state of the art. As benchmarks rapidly evolve to incorporate complex reasoning and subjective tasks, they increasingly necessitate open-ended responses from LALMs. We present Open-ended Response Correctness Assessment (ORCA) -- a reliable and lightweight model-based approach for answer correctness and disagreement modeling. We employ a three-stage annotation pipeline combining human judgment, structured feedback, and human-AI correction, yielding 9,663 annotations across 3,699 question-answer pairs from 15 LALMs on three audio understanding and reasoning benchmarks (achieving a Krippendorff's alpha of 0.82). Our experiments employing curriculum learning show that ORCA models achieve a Spearman correlation of 0.91 with average human correctness ratings on seen benchmarks and generalize to unseen benchmarks with a score of 0.85, outperforming several LLM judge baselines including Gemini 2.5 Flash. Furthermore, we demonstrate that ORCA's predicted variance correlates strongly with human disagreement, allowing it to effectively identify problematic benchmark items.
Get this paper in your agent:
hf papers read 2512.09066 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 3
BUT-FIT/orca-llama-3.2-3b-it-multinomial
Datasets citing this paper 1
BUT-FIT/orca-audio-qa-annotations
Spaces citing this paper 0
No Space linking this paper