Activating Self-Attention for Multi-Scene Absolute Pose Regression

Authors: Miso Lee, Jihwan Kim, Jae-Pil Heo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our solution recovers the self-attention successfully by preventing the distortion of query-key space and keeping high capacity of self-attention map [22]. As a result, our model outperforms existing MS-APR methods in both outdoor and indoor scenes without additional memory during inference, upholding the original purpose of MS-APR.
Researcher Affiliation Academia Miso Lee Sungkyunkwan University dlalth557@skku.edu Jihwan Kim Sungkyunkwan University damien@skku.edu Jae-Pil Heo Sungkyunkwan University jaepilheo@skku.edu
Pseudocode No The paper describes the proposed methods in text and equations but does not include any pseudocode or algorithm blocks.
Open Source Code Yes We include the code in the supplemental materials.
Open Datasets Yes Datasets. We train and evaluate the model on outdoor and indoor datasets [8, 27], which include RGB images labeled with 6-Do F camera poses. Firstly, we use the Cambridge Landmarks which consist of six outdoor scenes scaled from 875m2 to 5600m2. Each scene contains 200 to 1500 training data. ... On the other hand, we use the 7Scenes dataset which consists of seven indoor scenes scaled from 1m2 to 18m2. Each scene includes from 1000 to 7000 images.
Dataset Splits No The paper mentions using training data and evaluation on datasets but does not explicitly specify the proportion or number of samples allocated for validation splits.
Hardware Specification Yes We train the model with a single RTX3090 GPU
Software Dependencies No The paper mentions using 'Adam optimizer' but does not specify version numbers for any programming languages, libraries, or other software components used in the implementation.
Experiment Setup Yes We train the model with a single RTX3090 GPU, Adam optimizer with β1 = 0.9, β2 = 0.999, ϵ = 10 10, and the batch size of 8. For the 7Scenes dataset, we train the model for 30 epochs with the initial learning rate of 1 10 4, reducing the learning rate by 1/10 every 10 epochs. In the case of Cambridge Landmarks dataset, we train the model for 500 epochs with the initial learning rate of 1 10 4, reducing the learning rate by 1/10 every 200 epochs. ... Both for the position and orientation transformer encoder-decoder, the number of layers L is 6 and the number of heads H is 8. Lastly, we set the λaux for our query-key alignment loss as 0.1.