SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation
Authors: Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, Jiwen Lu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on MP3D, HM3D and Robo THOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. |
| Researcher Affiliation | Academia | Hang Yin1 , Xiuwei Xu1 , Zhenyu Wu2, Jie Zhou1, Jiwen Lu1 1Tsinghua University 2Beijing University of Posts and Telecommunications {yinh23, xxw21}@mails.tsinghua.edu.cn wuzhenyu@bupt.edu.cn, {jzhou, lujiwen}@tsinghua.edu.cn |
| Pseudocode | No | The paper does not contain any blocks explicitly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | We have uploaded our code on MP3D dataset in the supplementary materials. We will release all code upon acceptance. |
| Open Datasets | Yes | We evaluate our method on three datasets: Matterport3D (MP3D) [5], HM3D [31] and Robo THOR [9]. |
| Dataset Splits | Yes | We test our SG-Nav on the validation set, which contains 11 indoor scenes, 21 object goal categories and 2195 object-goal navigation episodes. HM3D is used in Habitat 2022 Object Nav challenge, containing 2000 validation episodes on 20 validation environments with 6 goal object categories. Robo THOR is used in Robo THOR 2020, 2021 Object Nav challenge, containing 1800 validation episodes on 15 validation environments with 12 goal object categories. |
| Hardware Specification | Yes | The evaluation of our model is performed on four RTX 3090 GPUs. |
| Software Dependencies | Yes | For the verification of short edge, we adopt LLaVA-1.6 (Mistral-7B) [24] as the VLM for discrimination. We choose LLaMA-7B [36] and GPT-4-0613 as the LLM for our SG-Nav, which are discriminated by SG-Nav-LLaMA and SG-Nav-GPT. |
| Experiment Setup | Yes | We set 500 as the maximal navigation steps. The farthest and closest perceived distances of the agent are 10m and 1.5m. Each step of the agent takes 0.25m, and each rotation takes 30°. The camera of agent is 0.90m above the ground, and perspective is horizontal. The camera outputs 640 × 480 RGB-D images. We maintain a 800 × 800 2D occupancy map with the resolution of 0.05m, which can represent a 40m × 40m large-scale scene. In equation 4, the hypre-parameters Nmax is 10 and Sthres is 0.8. |