Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Counterfactual Vision-and-Language Navigation: Unravelling the Unseen
Authors: Amin Parvaneh, Ehsan Abbasnejad, Damien Teney, Javen Qinfeng Shi, Anton van den Hengel
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that this technique provides significant improvements in generalisation on benchmarks for Room-to-Room navigation and Embodied Question Answering. Experiments on benchmarks for Room-to-Room (R2R) navigation [8] and Embodied Question Answering [9] show significant improvements. |
| Researcher Affiliation | Academia | Australian Institute for Machine Learning University of Adelaide, Australia EMAIL |
| Pseudocode | Yes | Algorithm 1: Training of a VLN agent through IL and RL, with factual data (original training set) and counterfactual observations (generated instances). |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the availability of its source code. |
| Open Datasets | Yes | Room-to-Room (R2R) [8] is a dataset of natural language instructions for indoor navigation collected using Amazon Mechanical Turk (AMT) and employing a simulator based on Matterport3D environments [39]. The training is based on 14, 025 pairs of instruction-visual path in 61 environments. Embodied Question Answering (EQA) [9] is a challenging variant of Vision and Language Navigation. The dataset consists of 6, 912 tuples of route-question-answer in 645 distinct training environments and a collection of 898 tuples in 57 unseen environments for the test set. |
| Dataset Splits | Yes | The training is based on 14, 025 pairs of instruction-visual path in 61 environments. The validation is done in two settings: (1) seen where the environment is from the training set but the instructions are not and (2) unseen where both the instructions and the visual observations are never seen by the agent. The dataset consists of 6, 912 tuples of route-question-answer in 645 distinct training environments and a collection of 898 tuples in 57 unseen environments for the test set. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for its experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or solvers used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We set the prior p(u) to Beta(0.75, 0.75), and use 5 interactions to optimise Eq. (11) with the learning rate set to 0.1. Using grid search, we concluded γ = 0.1 provides best results. We closely follow the experiment setup of [11] where the visual observations consists of the features extracted using the pretrained Res Net-152 [40] from the egocentric panoramic view of the agent. We optimise our models using RMSprop with a learning rate of 1 10 4 and batch size of 64 for 80, 000 iterations in all of our experiments, except when indicated. We set α 0.83 (i.e. α (1 α) = 5) by grid search in behavioural cloning setting (without counterfactual learning) for all the experiments. We train all of the models for 30 epochs (more than 10, 000 iterations) in a behavioural cloning setting with a batch size of 20 and learning rate set to 1 10 3 using Adam optimiser. |