reproducibilityindex.ai

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Authors: Abhinav Moudgil, Arjun Majumdar, Harsh Agrawal, Stefan Lee, Dhruv Batra

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with our proposed approach on the Room-to-Room (R2R) [1] and Room-Across-Room (Rx R) [2] datasets. Empirically, we ﬁnd that our model substantially improves VLN performance over our VLN BERT baseline on R2R and outperforms state-of-the-art methods on English language instructions in Rx R. Speciﬁcally, our proposed approach improves success weighted by path length (SPL) on the unseen validation split in R2R by 1.8 absolute percentage points. On Rx R a more challenging dataset due to indirect paths and greater variations in path length we see even larger improvements. Success rate (SR) improves by 3.7 absolute percentage points, alongside a gain of 2.4 absolute percentage points on the normalized dynamic time warping (NDTW) metric. Through ablation experiments we ﬁnd that (consistent with the observations in [3]) vision-and-language pretraining is vital to our approach, which suggests that strong visual grounding is key for using objectlevel features in VLN.
Researcher Affiliation	Academia	Abhinav Moudgil1 , Arjun Majumdar1, Harsh Agrawal1, Stefan Lee2, Dhruv Batra1 1 Georgia Institute of Technology, 2 Oregon State University
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	The paper provides a link (https://github.com/Yicong Hong/Recurrent-VLN-BERT) to the "released implementation" of VLN-BERT, which is a baseline model. There is no explicit statement or link providing concrete access to the source code for the SOAT model or methodology developed in this paper.
Open Datasets	Yes	We evaluate our method on the Room-to-Room (R2R) [1] and Room-Across-Room (Rx R) [2] datasets. R2R is built using Matterport3D (MP3D) [18] indoor environments and contains 21,567 path-instruction pairs, which are divided into four splits: training (14,025), val-seen (1,020), val-unseen (2,349) and test-unseen (4,173).
Dataset Splits	Yes	R2R is built using Matterport3D (MP3D) [18] indoor environments and contains 21,567 path-instruction pairs, which are divided into four splits: training (14,025), val-seen (1,020), val-unseen (2,349) and test-unseen (4,173). ...English includes 26,464 path-instruction pairs for training and 4,551 pairs in the val-unseen split.
Hardware Specification	Yes	We implemented our model in Py Torch [31] and trained on a single Nvidia Titan X GPU.
Software Dependencies	No	The paper mentions 'Py Torch [31]' and other tools like 'Res Net-152 model [32]', 'Faster R-CNN [26]', and 'spaCy [33]', but does not specify their version numbers, which is required for reproducibility.
Experiment Setup	Yes	We train with a constant learning rate of 1e-5 using the Adam W optimizer with a batch size of 16 for 300k iterations.