Speaker-Follower Models for Vision-and-Language Navigation
Authors: Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that all three components of this approach speaker-driven data augmentation, pragmatic reasoning and panoramic action space dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark. |
| Researcher Affiliation | Academia | 1University of California, Berkeley 2Carnegie Mellon University 3Boston University |
| Pseudocode | Yes | See Sec. B in the supplementary material for pseudocode. |
| Open Source Code | Yes | Our code and data are available at http://ronghanghu.com/speaker_follower. |
| Open Datasets | Yes | We use the Room-to-Room (R2R) vision-and-language navigation dataset [1] for our experimental evaluation. |
| Dataset Splits | Yes | The dataset is split into training, validation, and test sets. The validation set is split into two parts: seen, where routes are sampled from environments seen during training, and unseen with environments that are not seen during training. All the test set routes belong to new environments unseen in the training and validation sets. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU or CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using ResNet [21] and GloVe embeddings [38] but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The follower model is trained using student-forcing on this augmented data for 50, 000 iterations, and then fine-tuned on the the original human-produced data for 20, 000 iterations. For all experiments using pragmatic inference, we use a speaker weight of λ = 0.95, which we found to produce the best results on both the seen and unseen validation environments. |