Contrastive Instruction-Trajectory Learning for Vision-Language Navigation
Authors: Xiwen Liang, Fengda Zhu, Yi Zhu, Bingqian Lin, Bing Wang, Xiaodan Liang1592-1600
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that the model with CITL surpasses the previous state-of-the-art methods on R2R, R4R, and Rx R. |
| Researcher Affiliation | Collaboration | 1Shenzhen Campus of Sun Yat-sen University, Shenzhen 2Monash University 3Huawei Noah s Ark Lab 4Alibaba Group |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ liangcici/CITL-VLN. |
| Open Datasets | Yes | The R2R (Anderson et al. 2018b) dataset consists of 90 housing environments. The training set comprises 61 scenes, and the validation unseen set and test unseen set contain 11 and 18 scenes respectively. R4R (Jain et al. 2019) concatenates the trajectories and instructions in R2R. Rx R (Ku et al. 2020) is a larger dataset containing more extended instructions and trajectories. |
| Dataset Splits | Yes | The training set comprises 61 scenes, and the validation unseen set and test unseen set contain 11 and 18 scenes respectively. |
| Hardware Specification | Yes | All experiments are conducted on an NVIDIA 3090 GPU. |
| Software Dependencies | No | The paper mentions 'Mind Spore Lite tool' and 'Mind Spore' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | In all contrastive losses, the margin m is set to 0.25, and λ1, λ2 and λ3 are fixed to 0.1, 0.01 and 0.01 respectively. The size of all memory banks is fixed to 240. αp and αn are set to 1.2 and 1.4 respectively. Training schedules are the same as baselines (Tan, Yu, and Bansal 2019; Hong et al. 2021). |