HDINO: A Concise and Efficient Open-Vocabulary Detector
Abstract
HDINO is an efficient open-vocabulary object detector that uses a two-stage training strategy with semantic alignment and lightweight feature fusion to achieve high performance without manual data curation.
Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves 49.2 mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by 0.8 mAP and 2.8 mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve 56.4 mAP and 59.2 mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.
Community
A Concise and Efficient Open-Vocabulary Detector
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images (2026)
- Enhancing Open-Vocabulary Object Detection through Multi-Level Fine-Grained Visual-Language Alignment (2026)
- ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding (2026)
- Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection (2026)
- LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation (2026)
- Integrating Diverse Assignment Strategies into DETRs (2026)
- Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper