Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts
Authors: Haodong Hong, Sen Wang, Zi Huang, Qi Wu, Jiajun Liu
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on four VLN benchmarks (R2R, Rx R, REVERIE, CVDN) show that incorporating visual prompts significantly boosts navigation performance. |
| Researcher Affiliation | Collaboration | 1The University of Queensland 2CSIRO Data61 3The University of Adelaide |
| Pseudocode | No | The paper describes procedures and pipelines but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ honghd16/VLN-MP. |
| Open Datasets | Yes | We conduct extensive experiments on four datasets R2R, Rx R, REVERIE, and CVDN to demonstrate that agents trained in VLN-MP improve navigation performance across different image prompt settings while maintaining robustness in traditional VLN tasks. |
| Dataset Splits | Yes | The original datasets have four splits: train, validation seen (val seen), validation unseen (val unseen), and test unseen. |
| Hardware Specification | Yes | All models are fine-tuned for 200K iterations with a learning rate of 1e-5 and a batch size of 8 on a single NVIDIA A6000 GPU. |
| Software Dependencies | No | In the pipeline, we utilize GPT-4 from Open AI s official API, and the GLIP-L and Grounding DINO-T models for landmark detection. For non-English languages, we use the Google translate service to translate them into English. We generate five novel images per visual prompt using the control sd15 mlsd model for data augmentation. |
| Experiment Setup | Yes | All models are fine-tuned for 200K iterations with a learning rate of 1e-5 and a batch size of 8 on a single NVIDIA A6000 GPU. The weights for score balancing in Eq. (2) are β0 = 0.5 and β1 = 0.1 to prioritize the sequence score (Ss) over the others, reflecting their relative importance in our method. During training, we select the augmented data with a probability of γ = 0.2 to replace original landmark images. |