Counterfactual Vision-and-Language Navigation: Unravelling the Unseen
Authors: Amin Parvaneh, Ehsan Abbasnejad, Damien Teney, Javen Qinfeng Shi, Anton van den Hengel
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that this technique provides significant improvements in generalisation on benchmarks for Room-to-Room navigation and Embodied Question Answering. Experiments on benchmarks for Room-to-Room (R2R) navigation [8] and Embodied Question Answering [9] show significant improvements. |
| Researcher Affiliation | Academia | Australian Institute for Machine Learning University of Adelaide, Australia {amin.parvaneh, ehsan.abbasnejad, damien.teney, javen.shi, anton.vandenhengel}@adelaide.edu.au |
| Pseudocode | Yes | Algorithm 1: Training of a VLN agent through IL and RL, with factual data (original training set) and counterfactual observations (generated instances). |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the availability of its source code. |
| Open Datasets | Yes | Room-to-Room (R2R) [8] is a dataset of natural language instructions for indoor navigation collected using Amazon Mechanical Turk (AMT) and employing a simulator based on Matterport3D environments [39]. The training is based on 14, 025 pairs of instruction-visual path in 61 environments. Embodied Question Answering (EQA) [9] is a challenging variant of Vision and Language Navigation. The dataset consists of 6, 912 tuples of route-question-answer in 645 distinct training environments and a collection of 898 tuples in 57 unseen environments for the test set. |
| Dataset Splits | Yes | The training is based on 14, 025 pairs of instruction-visual path in 61 environments. The validation is done in two settings: (1) seen where the environment is from the training set but the instructions are not and (2) unseen where both the instructions and the visual observations are never seen by the agent. The dataset consists of 6, 912 tuples of route-question-answer in 645 distinct training environments and a collection of 898 tuples in 57 unseen environments for the test set. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for its experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or solvers used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We set the prior p(u) to Beta(0.75, 0.75), and use 5 interactions to optimise Eq. (11) with the learning rate set to 0.1. Using grid search, we concluded γ = 0.1 provides best results. We closely follow the experiment setup of [11] where the visual observations consists of the features extracted using the pretrained Res Net-152 [40] from the egocentric panoramic view of the agent. We optimise our models using RMSprop with a learning rate of 1 10 4 and batch size of 64 for 80, 000 iterations in all of our experiments, except when indicated. We set α 0.83 (i.e. α (1 α) = 5) by grid search in behavioural cloning setting (without counterfactual learning) for all the experiments. We train all of the models for 30 epochs (more than 10, 000 iterations) in a behavioural cloning setting with a batch size of 20 and learning rate set to 1 10 3 using Adam optimiser. |