reproducibilityindex.ai

Speaker-Follower Models for Vision-and-Language Navigation

Authors: Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that all three components of this approach speaker-driven data augmentation, pragmatic reasoning and panoramic action space dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.
Researcher Affiliation	Academia	1University of California, Berkeley 2Carnegie Mellon University 3Boston University
Pseudocode	Yes	See Sec. B in the supplementary material for pseudocode.
Open Source Code	Yes	Our code and data are available at http://ronghanghu.com/speaker_follower.
Open Datasets	Yes	We use the Room-to-Room (R2R) vision-and-language navigation dataset [1] for our experimental evaluation.
Dataset Splits	Yes	The dataset is split into training, validation, and test sets. The validation set is split into two parts: seen, where routes are sampled from environments seen during training, and unseen with environments that are not seen during training. All the test set routes belong to new environments unseen in the training and validation sets.
Hardware Specification	No	The paper does not specify the exact hardware (e.g., GPU or CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using ResNet [21] and GloVe embeddings [38] but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	The follower model is trained using student-forcing on this augmented data for 50, 000 iterations, and then ﬁne-tuned on the the original human-produced data for 20, 000 iterations. For all experiments using pragmatic inference, we use a speaker weight of λ = 0.95, which we found to produce the best results on both the seen and unseen validation environments.