Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision

Authors: Keji He, Yan Huang, Qi Wu, Jianhua Yang, Dong An, Shuanglin Sima, Liang Wang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our agent has superior navigation performance on Landmark-Rx R, en-Rx R and R2R.
Researcher Affiliation Academia 1Center for Research on Intelligent Perception and Computing National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3School of Computer Science, University of Adelaide 4School of Future Technology, University of Chinese Academy of Sciences 5School of Artificial Intelligence, Beijing University of Posts and Telecommunications 6Center for Excellence in Brain Science and Intelligence Technology (CEBSIT) 7Chinese Academy of Sciences, Artificial Intelligence Research (CAS-AIR) {keji.he, dong.an, shuanglin.sima}@cripac.ia.ac.cn {yhuang, wangliang}@nlpr.ia.ac.cn qi.wu01@adelaide.edu.au youngjianhua@bupt.edu.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our dataset and code are available at https://github.com/hekj/Landmark-Rx R.
Open Datasets Yes Our dataset and code are available at https://github.com/hekj/Landmark-Rx R. Our Landmark-Rx R is built based on the English Guide part of Rx R (en-Rx R). The training data involves four parts: the sub-instruction and sub-trajectory pairs (sub pairs) from Landmark-Rx R, the synthesized instruction and synthesized trajectory pairs (synthesized pairs) which are augmented data obtained by concatenating several continuous sub pairs like [16], the complete instruction and complete trajectory pairs (complete pairs) from en-Rx R, and the instruction and trajectory pairs from R2R.
Dataset Splits Yes Table 1 gives statistics on R2R, Rx R, en-Rx R and Landmark-Rx R. The total number of subinstructions from Landmark-Rx R is 166,740, which contains 133,602 sub-instructions in train split, 13,591 sub-instructions in validation seen split, and 19,547 sub-instructions in validation unseen split.
Hardware Specification Yes Model training consumes about 1,600 minutes at the stage of imitation learning and 3,400 minutes at the stage of reinforcement learning on a single GTX3090 GPU.
Software Dependencies No The paper mentions using visual features from ResNet and GloVe300 for word embedding but does not provide specific version numbers for any software dependencies like programming languages or libraries.
Experiment Setup Yes The importance factor λ in soft focal-oriented reward is set to 10. For the focal-oriented reward, we sample the same number of critical points in each trajectory to regularize the range of the reward value for each episode. Rsoft focal and Rhard focal both sample 2 critical points from landmark set in each trajectory, which has the best trade-off between SR and Loss Number metrics empirically. The maximum navigation step π allowed for each sub-trajectory is set to 10. The batch size is set to 100 and learning rate is 1e-4. The total iterations are 100,000 for imitation learning and 20,000 for reinforcement learning.