reproducibilityindex.ai

Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts

Authors: Haodong Hong, Sen Wang, Zi Huang, Qi Wu, Jiajun Liu

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on four VLN benchmarks (R2R, Rx R, REVERIE, CVDN) show that incorporating visual prompts significantly boosts navigation performance.
Researcher Affiliation	Collaboration	1The University of Queensland 2CSIRO Data61 3The University of Adelaide
Pseudocode	No	The paper describes procedures and pipelines but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code is available at https://github.com/ honghd16/VLN-MP.
Open Datasets	Yes	We conduct extensive experiments on four datasets R2R, Rx R, REVERIE, and CVDN to demonstrate that agents trained in VLN-MP improve navigation performance across different image prompt settings while maintaining robustness in traditional VLN tasks.
Dataset Splits	Yes	The original datasets have four splits: train, validation seen (val seen), validation unseen (val unseen), and test unseen.
Hardware Specification	Yes	All models are fine-tuned for 200K iterations with a learning rate of 1e-5 and a batch size of 8 on a single NVIDIA A6000 GPU.
Software Dependencies	No	In the pipeline, we utilize GPT-4 from Open AI s official API, and the GLIP-L and Grounding DINO-T models for landmark detection. For non-English languages, we use the Google translate service to translate them into English. We generate five novel images per visual prompt using the control sd15 mlsd model for data augmentation.
Experiment Setup	Yes	All models are fine-tuned for 200K iterations with a learning rate of 1e-5 and a batch size of 8 on a single NVIDIA A6000 GPU. The weights for score balancing in Eq. (2) are β0 = 0.5 and β1 = 0.1 to prioritize the sequence score (Ss) over the others, reflecting their relative importance in our method. During training, we select the augmented data with a probability of γ = 0.2 to replace original landmark images.