Vision-Language Navigation with Energy-Based Policy

Authors: Rui Liu, Wenguan Wang, Yi Yang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that ENP outperforms the counterparts, e.g., 2% SR and 1% SPL gains over R2R [1], 1.22% RGS on REVERIE [30], 2% SR on R2R-CE [31], and 1.07% NDTW on Rx R-CE [32], respectively. We examine the efficacy of ENP for VLN in discrete environments ( 4.1), and more challenging continuous environments ( 4.2). Then we provide diagnostic analysis on core model design ( 4.3).
Researcher Affiliation Academia Rui Liu1,2 Wenguan Wang2 Yi Yang1,2, 1The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, China 2College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Pseudocode Yes Algorithm 1 Energy-based Navigation Policy (ENP) Learning Algorithm
Open Source Code No As we promised, the code will be released upon the publication of our paper.
Open Datasets Yes R2R [1] contains 7, 189 shortest-path trajectories captured from 90 real-world building-scale scenes [87]. It consists of 22K step-by-step navigation instructions. REVERIE [30] contains 21, 702 high-level instructions... Both R2R [1] and REVERIE [30] are built upon Matterport3D Simulator [1]. Datasets. R2R-CE [31] and Rx R-CE [32] are more practical yet challenging...
Dataset Splits Yes All these datasets are devided into train, val seen, val unseen, and test unseen splits, which mainly focus on the generalization capability in unseen environments.
Hardware Specification Yes All experiments are conducted on a single NVIDIA 4090 GPU with 24GB memory in Py Torch. Testing is conducted on the same machine.
Software Dependencies No The paper mentions 'Py Torch' but does not specify a version number for it or any other software dependencies, such as specific libraries or optimizers with their versions.
Experiment Setup Yes For fairness, the hyper-parameters (e.g., batch size, optimizers, maximal iterations, learning rates) of these models are kept the original setup. For SGLD, ˆs0 is sampled from M with a probability of 95% (Eq. 9). In Env Drop [3] and VLN BERT [4], the number of SGLD iterations is set as I =15. Ipre =20 and Ift =5 of the pre-training and finetuning stages are used for DUET [6] and ETPNav [29]. ... The step size ϵ=1.5 and ξ N(0, 0.01) are set in ENP ( 4.3).