SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
Abstract
Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex acoustic perturbations. To address this limitation, we present SQuTR, a robustness benchmark for spoken query retrieval that includes a large-scale dataset and a unified evaluation protocol. SQuTR aggregates 37,317 unique queries from six commonly used English and Chinese text retrieval datasets, spanning multiple domains and diverse query types. We synthesize speech using voice profiles from 200 real speakers and mix 17 categories of real-world environmental noise under controlled SNR levels, enabling reproducible robustness evaluation from quiet to highly noisy conditions. Under the unified protocol, we conduct large-scale evaluations on representative cascaded and end-to-end retrieval systems. Experimental results show that retrieval performance decreases as noise increases, with substantially different drops across systems. Even large-scale retrieval models struggle under extreme noise, indicating that robustness remains a critical bottleneck. Overall, SQuTR provides a reproducible testbed for benchmarking and diagnostic analysis, and facilitates future research on robustness in spoken query to text retrieval.
Community
SQuTR (Spoken Query-to-Text Retrieval) is a large-scale bilingual benchmark designed to evaluate the robustness of information retrieval (IR) systems under realistic and complex acoustic perturbations.
While speech has become a primary interface for IR, performance often degrades significantly in noisy environments. SQuTR addresses this by extending 6 popular text retrieval datasets into the spoken domain, providing 37,317 complex queries across 6 domains, synthesized with 200 real speakers, and evaluated under 4 graded noise levels.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models (2026)
- Pardon? Evaluating Conversational Repair in Large Audio-Language Models (2026)
- AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning (2026)
- Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs (2026)
- MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs (2026)
- Covo-Audio Technical Report (2026)
- A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper