SuperVLAD: Compact and Robust Image Descriptors for Visual Place Recognition
Authors: Feng Lu, Xinyao Zhang, Canming Ye, Shuting Dong, Lijun Zhang, Xiangyuan Lan, Chun Yuan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results suggest that, when paired with a transformer-based backbone, our Super VLAD shows better domain generalization performance than Net VLAD with significantly fewer parameters. The proposed method also surpasses state-of-the-art methods with lower feature dimensions on several benchmark datasets. |
| Researcher Affiliation | Academia | Feng Lu1,2 Xinyao Zhang1 Canming Ye1 Shuting Dong1,2 Lijun Zhang3 Xiangyuan Lan2,4 Chun Yuan1 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Pengcheng Laboratory 3CIGIT, Chinese Academy of Sciences 4Pazhou Laboratory (Huangpu) |
| Pseudocode | No | The paper describes its methodology using descriptive text and mathematical equations, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/lu-feng/Super VLAD. |
| Open Datasets | Yes | We conduct experiments on several VPR benchmark datasets, which cover various challenges in VPR such as viewpoint changes, condition changes, and perceptual aliasing. Table 1 provides a concise summary of them. Pitts30k [54] mainly exhibits severe viewpoint changes in the urban environment. MSLS [57] is a comprehensive dataset comprising images collected in urban, suburban, and natural scenes over 7 years, and encompasses a wide range of visual changes (viewpoint and condition changes). Nordland [10] exhibits seasonal changes in natural and suburban scenes. SPED [11] consists of low-quality and high-scene-depth images captured from diverse scenes with various condition changes. |
| Dataset Splits | Yes | Pitts30k contains 10k database images in each of the training, validation, and test sets. |
| Hardware Specification | Yes | We implement our experiments on two NVIDIA Ge Force RTX 3090 GPUs using Py Torch. |
| Software Dependencies | No | We implement our experiments on two NVIDIA Ge Force RTX 3090 GPUs using Py Torch. |
| Experiment Setup | Yes | The size of the input image is 224 224 in training (322 322 in inference) and the token dimension of the DINOv2-base backbone is 768. We only fine-tune the last four transformer encoder layers (freeze the previous layers) of the DINOv2 backbone. For the loss function, we utilize the multi-similarity loss [56] and set its hyperparameters as in [1]. The model training is performed using the Adam optimizer with an initial learning rate of 0.00005, halved every 3 epochs. Considering that the cross-image encoder is not initialized, we use a larger learning rate (0.0001) to train it separately. Each training batch consists of 120 places, with 4 images per place, resulting in a total of 480 images. An inference batch consists of 8 images (except for the SPED dataset where the batch size is 4). The training process is terminated when the performance on MSLS-val does not improve for three epochs. The actual number of effective training epochs is 7, and the training time is 81.6 minutes. |