reproducibilityindex.ai

History Aware Multimodal Transformer for Vision-and-Language Navigation

Authors: Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with ﬁne-grained instructions (R2R, Rx R), high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back). We demonstrate HAMT to be particularly effective for navigation tasks with longer trajectories. We carry out extensive experiments on various VLN tasks, including VLN with ﬁne-grained instructions (R2R [6] and Rx R [7]), high-level instructions (REVERIE [8] and our proposed R2R-Last), dialogs [9] as well as long-horizon VLN (R4R [3] and our proposed R2R-Back which requires the agent to return back after arriving at the target location).
Researcher Affiliation	Academia	Inria, École normale supérieure, CNRS, PSL Research University
Pseudocode	No	The paper includes architectural diagrams (Figure 1 and Figure 2) but no sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor any structured steps formatted like code.
Open Source Code	No	The paper provides the URL 'https://cshizhe.github.io/projects/vln_hamt.html' which is a project demonstration page. While this page may link to source code, the paper itself does not provide a direct link to a specific code repository for the methodology.
Open Datasets	Yes	Datasets. We evaluate our method on four VLN tasks (seven datasets): VLN with ﬁne-grained instructions (R2R [6], Rx R [7]); VLN with high-level instructions (REVERIE [8], R2R-Last); vision-and-dialogue navigation (CVDN [9]); and long-horizon VLN (R4R [3], R2R-Back).
Dataset Splits	Yes	The dataset is split into train, val seen, val unseen and test unseen sets with 61, 56, 11 and 18 houses respectively. Houses in val seen split are the same as training, while houses in val unseen and test splits are different from training.
Hardware Specification	Yes	We train HAMT for 200k iterations with ﬁxed Vi T using learning rate of 5e-5 and batch size of 64 on 4 NVIDIA Tesla P100 GPUs ( 1 day). The whole HAMT model is trained end-to-end for 20k iterations on 20 NVIDIA V100 GPUs with learning rate of 5e-5 for Vi T and 1e-5 for the others ( 20 hours).
Software Dependencies	No	The paper mentions models and algorithms like 'Vi T-B/16', 'BERT', and 'A3C RL algorithm' but does not specify any software dependencies (e.g., Python, PyTorch, TensorFlow) with version numbers.
Experiment Setup	Yes	In training with proxy tasks, we randomly select proxy tasks for each mini-batch with predeﬁned ratio. We train HAMT for 200k iterations with ﬁxed Vi T using learning rate of 5e-5 and batch size of 64 on 4 NVIDIA Tesla P100 GPUs ( 1 day). The whole HAMT model is trained end-to-end for 20k iterations on 20 NVIDIA V100 GPUs with learning rate of 5e-5 for Vi T and 1e-5 for the others ( 20 hours). We use R2R training set and augmented pairs from [22] for training unless otherwise noted. In ﬁne-tuning with RL+IL, we set λ = 0.2 in Eq (3) and γ = 0.9. The model is ﬁne-tuned for 100k iterations with learning rate of 1e-5 and batch size of 8 on a single GPU.