Speaker-Follower Models for Vision-and-Language Navigation

Authors: Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that all three components of this approach speaker-driven data augmentation, pragmatic reasoning and panoramic action space dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.
Researcher Affiliation Academia 1University of California, Berkeley 2Carnegie Mellon University 3Boston University
Pseudocode Yes See Sec. B in the supplementary material for pseudocode.
Open Source Code Yes Our code and data are available at http://ronghanghu.com/speaker_follower.
Open Datasets Yes We use the Room-to-Room (R2R) vision-and-language navigation dataset [1] for our experimental evaluation.
Dataset Splits Yes The dataset is split into training, validation, and test sets. The validation set is split into two parts: seen, where routes are sampled from environments seen during training, and unseen with environments that are not seen during training. All the test set routes belong to new environments unseen in the training and validation sets.
Hardware Specification No The paper does not specify the exact hardware (e.g., GPU or CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using ResNet [21] and GloVe embeddings [38] but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The follower model is trained using student-forcing on this augmented data for 50, 000 iterations, and then fine-tuned on the the original human-produced data for 20, 000 iterations. For all experiments using pragmatic inference, we use a speaker weight of λ = 0.95, which we found to produce the best results on both the seen and unseen validation environments.