NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
Authors: Gengze Zhou, Yicong Hong, Qi Wu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive experiments, we demonstrate Nav GPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goals, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. We evaluate Nav GPT based on GPT-4 (Open AI 2023) and GPT-3.5 on the R2R dataset (Anderson et al. 2018). |
| Researcher Affiliation | Academia | Gengze Zhou1, Yicong Hong2, Qi Wu1* 1The University of Adelaide 2The Australian National University {gengze.zhou, qi.wu01}@adelaide.edu.au, mr.yiconghong@gmail.com |
| Pseudocode | No | The paper describes the system architecture and methodology in detail but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/Gengze Zhou/Nav GPT. |
| Open Datasets | Yes | We evaluate Nav GPT based on GPT-4 (Open AI 2023) and GPT-3.5 on the R2R dataset (Anderson et al. 2018). The R2R dataset is composed of 7189 trajectories, each corresponding to three fine-grained instructions. |
| Dataset Splits | Yes | The dataset is separated into the train, val seen, val unseen, and test unseen splits, with 61, 56, 11, and 18 indoor scenes, respectively. We apply the 783 trajectories in the 11 val unseen environments in all our experiments and for comparison to previous supervised approaches. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions models and tools used (e.g., BLIP-2, Fast-RCNN, GPT-4, GPT-3.5) but does not provide specific software environment details with version numbers (e.g., Python version, PyTorch version, CUDA version) needed for replication. |
| Experiment Setup | Yes | In our work, we set the field of view of each view as 45 , and turn the heading angle θ 45 per view from 0 to 360 , 8 directions in total. Besides, we turn the elevation angle ϕ 30 per view from 30 above the horizontal level to 30 below, 3 levels in total. As a result, we obtain 3 ˆ 8 24 egocentric views for each viewpoint. We utilize BLIP-2 Vi T-G Flan T5XL (Li et al. 2023a) as images translator and Fast-RCNN (Girshick 2015) as object detector. We investigate 3 granularity of visual representation from a viewpoint. Specifically, variant #1 utilizes an image with 60 Fo V, turn heading angle 30 degrees clock-wise to obtain 12 views from a viewpoint, while variant #2 and #3 utilize an image with 30, 45 Fo V, turn elevation angle 30 degrees from top to down, and turn heading angle 30, 45 degrees clockwise to form 36 views, 24 views respectively. |