reproducibilityindex.ai

SuperVLAD: Compact and Robust Image Descriptors for Visual Place Recognition

Authors: Feng Lu, Xinyao Zhang, Canming Ye, Shuting Dong, Lijun Zhang, Xiangyuan Lan, Chun Yuan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results suggest that, when paired with a transformer-based backbone, our Super VLAD shows better domain generalization performance than Net VLAD with significantly fewer parameters. The proposed method also surpasses state-of-the-art methods with lower feature dimensions on several benchmark datasets.
Researcher Affiliation	Academia	Feng Lu1,2 Xinyao Zhang1 Canming Ye1 Shuting Dong1,2 Lijun Zhang3 Xiangyuan Lan2,4 Chun Yuan1 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Pengcheng Laboratory 3CIGIT, Chinese Academy of Sciences 4Pazhou Laboratory (Huangpu)
Pseudocode	No	The paper describes its methodology using descriptive text and mathematical equations, but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/lu-feng/Super VLAD.
Open Datasets	Yes	We conduct experiments on several VPR benchmark datasets, which cover various challenges in VPR such as viewpoint changes, condition changes, and perceptual aliasing. Table 1 provides a concise summary of them. Pitts30k [54] mainly exhibits severe viewpoint changes in the urban environment. MSLS [57] is a comprehensive dataset comprising images collected in urban, suburban, and natural scenes over 7 years, and encompasses a wide range of visual changes (viewpoint and condition changes). Nordland [10] exhibits seasonal changes in natural and suburban scenes. SPED [11] consists of low-quality and high-scene-depth images captured from diverse scenes with various condition changes.
Dataset Splits	Yes	Pitts30k contains 10k database images in each of the training, validation, and test sets.
Hardware Specification	Yes	We implement our experiments on two NVIDIA Ge Force RTX 3090 GPUs using Py Torch.
Software Dependencies	No	We implement our experiments on two NVIDIA Ge Force RTX 3090 GPUs using Py Torch.
Experiment Setup	Yes	The size of the input image is 224 224 in training (322 322 in inference) and the token dimension of the DINOv2-base backbone is 768. We only fine-tune the last four transformer encoder layers (freeze the previous layers) of the DINOv2 backbone. For the loss function, we utilize the multi-similarity loss [56] and set its hyperparameters as in [1]. The model training is performed using the Adam optimizer with an initial learning rate of 0.00005, halved every 3 epochs. Considering that the cross-image encoder is not initialized, we use a larger learning rate (0.0001) to train it separately. Each training batch consists of 120 places, with 4 images per place, resulting in a total of 480 images. An inference batch consists of 8 images (except for the SPED dataset where the batch size is 4). The training process is terminated when the performance on MSLS-val does not improve for three epochs. The actual number of effective training epochs is 7, and the training time is 81.6 minutes.