arxiv:2602.23898

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Published on Feb 27

· Submitted by

Qihua Dong on Mar 2

Northeastern University

Upvote

Authors:

Abstract

Ref-Adv is a challenging benchmark for referring expression comprehension that eliminates shortcut solutions by using complex linguistic expressions with minimal identifying information and hard distractors, revealing limitations in current multimodal LLMs' visual reasoning capabilities.

AI-generated summary

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.

View arXiv page View PDF Project page GitHub 15 Add to collection

Community

dddraxxx

Paper submitter about 23 hours ago

•

edited about 23 hours ago

Ref-Adv: a modern referring expression comprehension benchmark that suppresses shortcuts in standard benchmarks by pairing complex expressions with hard visual distractors. We release Ref-Adv-s (1,142 cases) with evaluation code and prediction files for the Qwen 2.5–3.5 VL series.