Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

Authors: Feng Lu, Lijun Zhang, Xiangyuan Lan, Shuting Dong, Yaowei Wang, Chun Yuan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the twostage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission).
Researcher Affiliation Collaboration Feng Lu1,2, Lijun Zhang3, Xiangyuan Lan2 , Shuting Dong1, Yaowei Wang2, Chun Yuan1 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Peng Cheng Laboratory 3University of Chinese Academy of Sciences
Pseudocode No The paper describes methods in text and mathematical formulas but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is released at https://github.com/Lu-Feng/SelaVPR.
Open Datasets Yes Several VPR benchmark datasets mainly including Tokyo24/7, MSLS, and Pitts30k are used in our experiments. Table 1 summarizes their main information. Tokyo24/7 (Torii et al., 2015)... Mapillary Street-Level Sequences (MSLS) (Warburg et al., 2020)... Pittsburgh (Pitts30k) (Torii et al., 2013)...
Dataset Splits Yes We assess models on both MSLS-val and MSLS-challenge (an online test set without released labels) sets. Pittsburgh (Pitts30k) (Torii et al., 2013) contains 30k reference images and 24k query images in the train, val and test sets... When the R@5 on the validation set does not have improvement within 3 epochs, the training is terminated.
Hardware Specification Yes We use the DINOv2 based on ViT-L/14 as the foundation model and conduct all experiments on an NVIDIA GeForce RTX 3090 GPU using PyTorch.
Software Dependencies No The paper mentions "PyTorch" as software used, but does not specify a version number or list other software components with their versions.
Experiment Setup Yes Fed a 224 224 image, the model produces a 1024-dim global feature and a dense grid of 128-dim local features. The bottleneck ratio of the adapters in ViT blocks is 0.5 and the scaling factor s in Eq. 4 is set to 0.2. We use 3 3 up-conv with stride=2 and padding=1 in the local adaptation module. The output channels of the first and second up-conv layers are 256 and 128, respectively. Following other two-stage methods, we rerank the top-100 candidates to yield final results. We train our models using the Adam optimizer with the learning rate set as 0.00001 and batch size set as 4. When the R@5 on the validation set does not have improvement within 3 epochs, the training is terminated. For MSLS, we set an epoch as passing 30k queries, whereas Pitts30k is passing 5k queries. In model training, we define the potential positive images as the reference images that are within 10 meters from the query image, while the definite negative images are those further than 25 meters. Two hard negative images from 1000 randomly chosen definite negatives are used in the triplet loss. We empirically set the margin m = 0.1 in Eq. 9, the weight λ = 1 in Eq. 11.