Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule
Authors: Shuhei Kurita, Kyunghyun Cho
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we show that the proposed generative approach outperforms the discriminative approach in the Room-2-Room (R2R) and Room-4-Room (R4R) datasets, especially in the unseen environments. We further show that the combination of the generative and discriminative policies achieves close to the state-of-the art results in the R2R dataset, demonstrating that the generative and discriminative policies capture the different aspects of VLN. |
| Researcher Affiliation | Academia | Shuhei Kurita AIP, RIKEN PRESTO, JST shuhei.kurita@riken.jp Kyunghyun Cho Courant Institute, New York University Center for Data Science, New York University CIFAR Fellow kyunghyun.cho@nyu.edu |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly present in the paper. |
| Open Source Code | Yes | The source code is available at https://github.com/shuheikurita/glgp. |
| Open Datasets | Yes | We conduct our experiments on the R2R navigation task (Anderson et al., 2018b), which is widely used for evaluating language-grounded navigation models and R4R (Jain et al., 2019), which consists of longer and more complex paths when compared to R2R. |
| Dataset Splits | Yes | R2R contains four splits of data: train, validation-seen, validation-unseen and test-unseen. From the 90 scenes of Matterport 3D modelings (Chang et al., 2017), 61 scenes are pooled together and used as seen environments in both the training and validation-seen sets. Among the remaining scenes, 11 scenes form the validation-unseen set and 18 scenes the test-unseen set. ... The training set has 14,025 instructions, while the validation-seen and validation-unseen datasets have 1,020 and 2,349 instructions respectively. |
| Hardware Specification | Yes | We use a single NVIDIA V100 GPU for training. |
| Software Dependencies | No | The paper mentions using a neural network architecture and following existing codebases but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We use the minibatch-size of 25. We use the validation-unseen dataset to select hyperparameters. We use the mixture of supervised learning and imitation learning (Tan et al., 2019; Li et al., 2019) for both the generative and discriminative policies, which are referred as teacher-forcing and studentforcing (Anderson et al., 2018b). In particular, during training between the reference action a T and a sampled action a S, we select the next action by a = δa S + (1 δ)a T where δ Bernoulli(η) following Li et al. (2019). We examine η [0, 1/5, 1/3, 1/2, 1] using the validation set and choose η = 1/3. ... β [0, 1] is a hyperparameter... In our experiment, we report the score of β = 0.5. |