Cross-view Geo-localization with Layer-to-Layer Transformer

Authors: Hongji Yang, Xiufan Lu, Yingying Zhu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our L2LTR performs favorably against state-of-the-art methods on standard, fine-grained, and cross-dataset cross-view geo-localization tasks.
Researcher Affiliation Academia Hongji Yang Xiufan Lu Yingying Zhu College of Computer Science and Software Engineering Shenzhen University {yanghongji2020, luxiufan2019}@email.szu.edu.cn zhuyy@szu.edu.cn
Pseudocode No The paper includes illustrations of the model architecture (Figure 1) and mathematical formulations, but no structured pseudocode or algorithm blocks are provided.
Open Source Code Yes The code is available online.3 https://github.com/yanghongji2007/cross_view_localization_L2LTR
Open Datasets Yes To verify our model s effectiveness, we conduct extensive experiments on three widely used benchmarks: CVUSA [24] and CVACT [9] (including CVACT_val and CVACT_test). The CVUSA dataset provides 35,532 image pairs for training and 8,884 image pairs for testing. The CVACT dataset contains 35,532 pairs for training and 8,884 pairs for validation (denoted as CVACT_val).
Dataset Splits Yes The CVUSA dataset provides 35,532 image pairs for training and 8,884 image pairs for testing. The CVACT dataset contains 35,532 pairs for training and 8,884 pairs for validation (denoted as CVACT_val).
Hardware Specification Yes The model is trained using Adam W [10] with a cosine learning rate schedule on a 32GB NVIDIA V100 GPU.
Software Dependencies No The paper mentions 'Adam W [10]' as the optimizer and 'Image Net [3]' for pre-trained parameters, but does not specify version numbers for any software dependencies like Python, PyTorch, TensorFlow, etc. Only the optimizer name is given.
Experiment Setup Yes If not specified, the ground and aerial image sizes are set to 128 512 and 256 256, respectively. We empirically set model depth L to 12 and initialize our L2LTR with pre-trained parameters on Image Net [3]. The model is trained using Adam W [10] with a cosine learning rate schedule on a 32GB NVIDIA V100 GPU. The learning rate is set to 1e-4, the weight decay is chosen to 0.03, and the batch size is 32. For the weighted soft-margin triplet loss [8], α is set to 10.