SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
Authors: Abhinav Moudgil, Arjun Majumdar, Harsh Agrawal, Stefan Lee, Dhruv Batra
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment with our proposed approach on the Room-to-Room (R2R) [1] and Room-Across-Room (Rx R) [2] datasets. Empirically, we find that our model substantially improves VLN performance over our VLN BERT baseline on R2R and outperforms state-of-the-art methods on English language instructions in Rx R. Specifically, our proposed approach improves success weighted by path length (SPL) on the unseen validation split in R2R by 1.8 absolute percentage points. On Rx R a more challenging dataset due to indirect paths and greater variations in path length we see even larger improvements. Success rate (SR) improves by 3.7 absolute percentage points, alongside a gain of 2.4 absolute percentage points on the normalized dynamic time warping (NDTW) metric. Through ablation experiments we find that (consistent with the observations in [3]) vision-and-language pretraining is vital to our approach, which suggests that strong visual grounding is key for using objectlevel features in VLN. |
| Researcher Affiliation | Academia | Abhinav Moudgil1 , Arjun Majumdar1, Harsh Agrawal1, Stefan Lee2, Dhruv Batra1 1 Georgia Institute of Technology, 2 Oregon State University |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper provides a link (https://github.com/Yicong Hong/Recurrent-VLN-BERT) to the "released implementation" of VLN-BERT, which is a baseline model. There is no explicit statement or link providing concrete access to the source code for the SOAT model or methodology developed in this paper. |
| Open Datasets | Yes | We evaluate our method on the Room-to-Room (R2R) [1] and Room-Across-Room (Rx R) [2] datasets. R2R is built using Matterport3D (MP3D) [18] indoor environments and contains 21,567 path-instruction pairs, which are divided into four splits: training (14,025), val-seen (1,020), val-unseen (2,349) and test-unseen (4,173). |
| Dataset Splits | Yes | R2R is built using Matterport3D (MP3D) [18] indoor environments and contains 21,567 path-instruction pairs, which are divided into four splits: training (14,025), val-seen (1,020), val-unseen (2,349) and test-unseen (4,173). ...English includes 26,464 path-instruction pairs for training and 4,551 pairs in the val-unseen split. |
| Hardware Specification | Yes | We implemented our model in Py Torch [31] and trained on a single Nvidia Titan X GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch [31]' and other tools like 'Res Net-152 model [32]', 'Faster R-CNN [26]', and 'spaCy [33]', but does not specify their version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | We train with a constant learning rate of 1e-5 using the Adam W optimizer with a batch size of 16 for 300k iterations. |