VG-SSL: Benchmarking Self-supervised Representation Learning Approaches for Visual Geo-localization
Abstract
A novel self-supervised learning framework for visual geo-localization that demonstrates superior performance through contrastive learning and information maximization methods.
Visual Geo-localization (VG) is a critical research area for identifying geo-locations from visual inputs, particularly in autonomous navigation for robotics and vehicles. Current VG methods often learn feature extractors from geo-labeled images to create dense, geographically relevant representations. Recent advances in Self-Supervised Learning (SSL) have demonstrated its capability to achieve performance on par with supervised techniques with unlabeled images. This study presents a novel VG-SSL framework, designed for versatile integration and benchmarking of diverse SSL methods for representation learning in VG, featuring a unique geo-related pair strategy, GeoPair. Through extensive performance analysis, we adapt SSL techniques to improve VG on datasets from hand-held and car-mounted cameras used in robotics and autonomous vehicles. Our results show that contrastive learning and information maximization methods yield superior geo-specific representation quality, matching or surpassing the performance of state-of-the-art VG techniques. To our knowledge, This is the first benchmarking study of SSL in VG, highlighting its potential in enhancing geo-specific visual representations for robotics and autonomous vehicles. The code is publicly available at https://github.com/arplaboratory/VG-SSL.
Community
Proposes VG-SSL (Visual Geolocalization with Self-Supervised Learning): tests adaptation of various SSL methods (SimCLR, MoCo v2, BYOL, SimSiam, Barlow Twins, and VICReg); can rasch performance of supervised methods without memory-heavy hard negative mining (HNM) - SSL only requires selecting positive samples. Summary of methods: SimCLR and MoCo use contrastive learning with InfoNCE loss (MoCo has ME - momentum encoder); SimSiam and BYOL are self-distillation with stop-gradient (SG) for target, predictor on target encoder (PR), batch norm (BN) in projector or predictor (BYOL has ME also) with embedding prediction loss; Barlow Twins and VICReg use Information maximization with cross correlation and VIC regularisation losses (contain BN with large dimensional embeddings - LP). Given database and query, get positives and negatives from database, group them (query-positive and identical negative groups), pass through trainable feature (embedding) extractor, use SSL loss. InfoNCE loss (group positives together and push others apart), embedding prediction loss (student has shallow MLP projection and has to match non-trainable/stopgrad teacher), cross-correlation (CC) methods in Barlow Twins loss (make CC matrix with positive pairs and enforce strong correlation with diagonal and -1 correlation for off-diagonal), VICReg loss (invariance, variance, and covariance terms) - don’t (batch) normalize embeddings when computing variance terms (it’ll not give true representations). You might miss information-worthy negative samples in mining, do random sampling of negatives (all without positives for a query) as a database negative ratio (fraction of number of queries sampled per epoch); form query-positive pairs and identical negative pairs for sampling (with same ratio). Uses ResNet-50 as local feature extractor and NetVLAD as global feature aggregator (trying out different SSL losses). MoCov2, BYOL, SimCLR, and BT are better than SimSiam and VICReg. Extended ablations in appendix. From NYU.
Links: PapersWithCode, GitHub
The GitHub link is invalid. Could you update it?
Hey
@QianC95
,
I'm unable to find the code implementation of this paper. Looks like the authors still haven't released the code.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper