Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
Abstract
Ref-Adv is a challenging benchmark for referring expression comprehension that eliminates shortcut solutions by using complex linguistic expressions with minimal identifying information and hard distractors, revealing limitations in current multimodal LLMs' visual reasoning capabilities.
Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.
Community
Ref-Adv: a modern referring expression comprehension benchmark that suppresses shortcuts in standard benchmarks by pairing complex expressions with hard visual distractors. We release Ref-Adv-s (1,142 cases) with evaluation code and prediction files for the Qwen 2.5–3.5 VL series.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions (2026)
- RegionReasoner: Region-Grounded Multi-Round Visual Reasoning (2026)
- Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies (2026)
- Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models (2026)
- GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models (2026)
- Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions (2026)
- Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper