reproducibilityindex.ai

Diagnosing the Environment Bias in Vision-and-Language Navigation

Authors: Yubo Zhang, Hao Tan, Mohit Bansal

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias. We observe that neither the language nor the underlying navigational graph, but the low-level visual appearance conveyed by Res Net features directly affects the agent model and contributes to this environment bias in results.
Researcher Affiliation	Academia	Yubo Zhang , Hao Tan and Mohit Bansal UNC Chapel Hill {zhangyb, haotan, mbansal}@cs.unc.edu
Pseudocode	No	The paper does not include any blocks labeled as "Pseudocode" or "Algorithm". It presents mathematical formulas for feature calculation and MLP training, but these are not structured as pseudocode.
Open Source Code	Yes	Code, features at https://github.com/zhangybzbo/Env Bias VLN.
Open Datasets	Yes	Vision-and-Language Navigation: Several datasets have been released recently, such as Roomto-Room [Anderson et al., 2018b], Room-for-Room [Jain et al., 2019], Touchdown [Chen et al., 2019b], CVDN [Thomason et al., 2019b] and EQA [Das et al., 2018]. ... indoor VLN datasets (e.g., those collected from Matterport3D [Chang et al., 2017]) use disjoint sets of environments in training and testing.
Dataset Splits	Yes	Two validation splits are provided as well: validation seen (which takes the data from training environments) and validation unseen (whose data is taken from testing environments different from the training). ... we re-split the environment and categorize the validation data into three sets based on their visibility to the training set: path-seen, path-unseen, and env-unseen. ... As shown in Table 2, the agent performs better on val path-seen than val path-unseen, which suggests that a pathlevel locality exists in current VLN agent models.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only discusses the general setup without mentioning any concrete hardware.
Software Dependencies	No	The paper mentions several software components and models like "Res Net features", "Faster R-CNN", and "multi-layer perceptron (MLP)", but it does not specify version numbers for any programming languages, libraries, or frameworks (e.g., Python version, PyTorch/TensorFlow version, specific ResNet model version beyond 152).
Experiment Setup	Yes	We use a multi-layer perceptron (MLP) to generate the Learned-Seg semantic features. The multi-layer perceptron includes three fully-connected layers with Re LU activation on the outputs of the ﬁrst two layers. The input of this MLP is the 2048-dim Res Net feature f of each image view. The hidden sizes of the ﬁrst two layers are 512 and 256. The ﬁnal layer will output the 42-dim semantic feature y that represents the areas of each semantic class. After the linear layers, we use the sigmoid function σ to convert the output to the ratio of areas. ... The model is trained with ground-truth semantic areas (normalized to [0, 1]) of the views in 51 environments out of total 61 VLN training environments, and is tuned on the remaining 10 environments. We minimize the binary cross-entropy loss between the ground-truth areas {y i } and the predicted areas {yi}, where i indicate the i-th semantic class. Dropout layers with a probability of 0.3 are added between fully-connected layers while training.