History Aware Multimodal Transformer for Vision-and-Language Navigation
Authors: Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, Rx R), high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back). We demonstrate HAMT to be particularly effective for navigation tasks with longer trajectories. We carry out extensive experiments on various VLN tasks, including VLN with fine-grained instructions (R2R [6] and Rx R [7]), high-level instructions (REVERIE [8] and our proposed R2R-Last), dialogs [9] as well as long-horizon VLN (R4R [3] and our proposed R2R-Back which requires the agent to return back after arriving at the target location). |
| Researcher Affiliation | Academia | Inria, École normale supérieure, CNRS, PSL Research University |
| Pseudocode | No | The paper includes architectural diagrams (Figure 1 and Figure 2) but no sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor any structured steps formatted like code. |
| Open Source Code | No | The paper provides the URL 'https://cshizhe.github.io/projects/vln_hamt.html' which is a project demonstration page. While this page may link to source code, the paper itself does not provide a direct link to a specific code repository for the methodology. |
| Open Datasets | Yes | Datasets. We evaluate our method on four VLN tasks (seven datasets): VLN with fine-grained instructions (R2R [6], Rx R [7]); VLN with high-level instructions (REVERIE [8], R2R-Last); vision-and-dialogue navigation (CVDN [9]); and long-horizon VLN (R4R [3], R2R-Back). |
| Dataset Splits | Yes | The dataset is split into train, val seen, val unseen and test unseen sets with 61, 56, 11 and 18 houses respectively. Houses in val seen split are the same as training, while houses in val unseen and test splits are different from training. |
| Hardware Specification | Yes | We train HAMT for 200k iterations with fixed Vi T using learning rate of 5e-5 and batch size of 64 on 4 NVIDIA Tesla P100 GPUs ( 1 day). The whole HAMT model is trained end-to-end for 20k iterations on 20 NVIDIA V100 GPUs with learning rate of 5e-5 for Vi T and 1e-5 for the others ( 20 hours). |
| Software Dependencies | No | The paper mentions models and algorithms like 'Vi T-B/16', 'BERT', and 'A3C RL algorithm' but does not specify any software dependencies (e.g., Python, PyTorch, TensorFlow) with version numbers. |
| Experiment Setup | Yes | In training with proxy tasks, we randomly select proxy tasks for each mini-batch with predefined ratio. We train HAMT for 200k iterations with fixed Vi T using learning rate of 5e-5 and batch size of 64 on 4 NVIDIA Tesla P100 GPUs ( 1 day). The whole HAMT model is trained end-to-end for 20k iterations on 20 NVIDIA V100 GPUs with learning rate of 5e-5 for Vi T and 1e-5 for the others ( 20 hours). We use R2R training set and augmented pairs from [22] for training unless otherwise noted. In fine-tuning with RL+IL, we set λ = 0.2 in Eq (3) and γ = 0.9. The model is fine-tuned for 100k iterations with learning rate of 1e-5 and batch size of 8 on a single GPU. |