Diagnosing the Environment Bias in Vision-and-Language Navigation

Authors: Yubo Zhang, Hao Tan, Mohit Bansal

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias. We observe that neither the language nor the underlying navigational graph, but the low-level visual appearance conveyed by Res Net features directly affects the agent model and contributes to this environment bias in results.
Researcher Affiliation Academia Yubo Zhang , Hao Tan and Mohit Bansal UNC Chapel Hill {zhangyb, haotan, mbansal}@cs.unc.edu
Pseudocode No The paper does not include any blocks labeled as "Pseudocode" or "Algorithm". It presents mathematical formulas for feature calculation and MLP training, but these are not structured as pseudocode.
Open Source Code Yes Code, features at https://github.com/zhangybzbo/Env Bias VLN.
Open Datasets Yes Vision-and-Language Navigation: Several datasets have been released recently, such as Roomto-Room [Anderson et al., 2018b], Room-for-Room [Jain et al., 2019], Touchdown [Chen et al., 2019b], CVDN [Thomason et al., 2019b] and EQA [Das et al., 2018]. ... indoor VLN datasets (e.g., those collected from Matterport3D [Chang et al., 2017]) use disjoint sets of environments in training and testing.
Dataset Splits Yes Two validation splits are provided as well: validation seen (which takes the data from training environments) and validation unseen (whose data is taken from testing environments different from the training). ... we re-split the environment and categorize the validation data into three sets based on their visibility to the training set: path-seen, path-unseen, and env-unseen. ... As shown in Table 2, the agent performs better on val path-seen than val path-unseen, which suggests that a pathlevel locality exists in current VLN agent models.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only discusses the general setup without mentioning any concrete hardware.
Software Dependencies No The paper mentions several software components and models like "Res Net features", "Faster R-CNN", and "multi-layer perceptron (MLP)", but it does not specify version numbers for any programming languages, libraries, or frameworks (e.g., Python version, PyTorch/TensorFlow version, specific ResNet model version beyond 152).
Experiment Setup Yes We use a multi-layer perceptron (MLP) to generate the Learned-Seg semantic features. The multi-layer perceptron includes three fully-connected layers with Re LU activation on the outputs of the first two layers. The input of this MLP is the 2048-dim Res Net feature f of each image view. The hidden sizes of the first two layers are 512 and 256. The final layer will output the 42-dim semantic feature y that represents the areas of each semantic class. After the linear layers, we use the sigmoid function σ to convert the output to the ratio of areas. ... The model is trained with ground-truth semantic areas (normalized to [0, 1]) of the views in 51 environments out of total 61 VLN training environments, and is tuned on the remaining 10 environments. We minimize the binary cross-entropy loss between the ground-truth areas {y i } and the predicted areas {yi}, where i indicate the i-th semantic class. Dropout layers with a probability of 0.3 are added between fully-connected layers while training.