Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts

Authors: Haodong Hong, Sen Wang, Zi Huang, Qi Wu, Jiajun Liu

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four VLN benchmarks (R2R, Rx R, REVERIE, CVDN) show that incorporating visual prompts significantly boosts navigation performance.
Researcher Affiliation Collaboration 1The University of Queensland 2CSIRO Data61 3The University of Adelaide
Pseudocode No The paper describes procedures and pipelines but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code is available at https://github.com/ honghd16/VLN-MP.
Open Datasets Yes We conduct extensive experiments on four datasets R2R, Rx R, REVERIE, and CVDN to demonstrate that agents trained in VLN-MP improve navigation performance across different image prompt settings while maintaining robustness in traditional VLN tasks.
Dataset Splits Yes The original datasets have four splits: train, validation seen (val seen), validation unseen (val unseen), and test unseen.
Hardware Specification Yes All models are fine-tuned for 200K iterations with a learning rate of 1e-5 and a batch size of 8 on a single NVIDIA A6000 GPU.
Software Dependencies No In the pipeline, we utilize GPT-4 from Open AI s official API, and the GLIP-L and Grounding DINO-T models for landmark detection. For non-English languages, we use the Google translate service to translate them into English. We generate five novel images per visual prompt using the control sd15 mlsd model for data augmentation.
Experiment Setup Yes All models are fine-tuned for 200K iterations with a learning rate of 1e-5 and a batch size of 8 on a single NVIDIA A6000 GPU. The weights for score balancing in Eq. (2) are β0 = 0.5 and β1 = 0.1 to prioritize the sequence score (Ss) over the others, reflecting their relative importance in our method. During training, we select the augmented data with a probability of γ = 0.2 to replace original landmark images.