Contrastive Instruction-Trajectory Learning for Vision-Language Navigation

Authors: Xiwen Liang, Fengda Zhu, Yi Zhu, Bingqian Lin, Bing Wang, Xiaodan Liang1592-1600

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that the model with CITL surpasses the previous state-of-the-art methods on R2R, R4R, and Rx R.
Researcher Affiliation Collaboration 1Shenzhen Campus of Sun Yat-sen University, Shenzhen 2Monash University 3Huawei Noah s Ark Lab 4Alibaba Group
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/ liangcici/CITL-VLN.
Open Datasets Yes The R2R (Anderson et al. 2018b) dataset consists of 90 housing environments. The training set comprises 61 scenes, and the validation unseen set and test unseen set contain 11 and 18 scenes respectively. R4R (Jain et al. 2019) concatenates the trajectories and instructions in R2R. Rx R (Ku et al. 2020) is a larger dataset containing more extended instructions and trajectories.
Dataset Splits Yes The training set comprises 61 scenes, and the validation unseen set and test unseen set contain 11 and 18 scenes respectively.
Hardware Specification Yes All experiments are conducted on an NVIDIA 3090 GPU.
Software Dependencies No The paper mentions 'Mind Spore Lite tool' and 'Mind Spore' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes In all contrastive losses, the margin m is set to 0.25, and λ1, λ2 and λ3 are fixed to 0.1, 0.01 and 0.01 respectively. The size of all memory banks is fixed to 240. αp and αn are set to 1.2 and 1.4 respectively. Training schedules are the same as baselines (Tan, Yu, and Bansal 2019; Hong et al. 2021).