Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule

Authors: Shuhei Kurita, Kyunghyun Cho

ICLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we show that the proposed generative approach outperforms the discriminative approach in the Room-2-Room (R2R) and Room-4-Room (R4R) datasets, especially in the unseen environments. We further show that the combination of the generative and discriminative policies achieves close to the state-of-the art results in the R2R dataset, demonstrating that the generative and discriminative policies capture the different aspects of VLN.
Researcher Affiliation Academia Shuhei Kurita AIP, RIKEN PRESTO, JST EMAIL Kyunghyun Cho Courant Institute, New York University Center for Data Science, New York University CIFAR Fellow EMAIL
Pseudocode No No pseudocode or algorithm blocks are explicitly present in the paper.
Open Source Code Yes The source code is available at https://github.com/shuheikurita/glgp.
Open Datasets Yes We conduct our experiments on the R2R navigation task (Anderson et al., 2018b), which is widely used for evaluating language-grounded navigation models and R4R (Jain et al., 2019), which consists of longer and more complex paths when compared to R2R.
Dataset Splits Yes R2R contains four splits of data: train, validation-seen, validation-unseen and test-unseen. From the 90 scenes of Matterport 3D modelings (Chang et al., 2017), 61 scenes are pooled together and used as seen environments in both the training and validation-seen sets. Among the remaining scenes, 11 scenes form the validation-unseen set and 18 scenes the test-unseen set. ... The training set has 14,025 instructions, while the validation-seen and validation-unseen datasets have 1,020 and 2,349 instructions respectively.
Hardware Specification Yes We use a single NVIDIA V100 GPU for training.
Software Dependencies No The paper mentions using a neural network architecture and following existing codebases but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We use the minibatch-size of 25. We use the validation-unseen dataset to select hyperparameters. We use the mixture of supervised learning and imitation learning (Tan et al., 2019; Li et al., 2019) for both the generative and discriminative policies, which are referred as teacher-forcing and studentforcing (Anderson et al., 2018b). In particular, during training between the reference action a T and a sampled action a S, we select the next action by a = δa S + (1 δ)a T where δ Bernoulli(η) following Li et al. (2019). We examine η [0, 1/5, 1/3, 1/2, 1] using the validation set and choose η = 1/3. ... β [0, 1] is a hyperparameter... In our experiment, we report the score of β = 0.5.