SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

Authors: Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, Jiwen Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on MP3D, HM3D and Robo THOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable.
Researcher Affiliation Academia Hang Yin1 , Xiuwei Xu1 , Zhenyu Wu2, Jie Zhou1, Jiwen Lu1 1Tsinghua University 2Beijing University of Posts and Telecommunications {yinh23, xxw21}@mails.tsinghua.edu.cn wuzhenyu@bupt.edu.cn, {jzhou, lujiwen}@tsinghua.edu.cn
Pseudocode No The paper does not contain any blocks explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code Yes We have uploaded our code on MP3D dataset in the supplementary materials. We will release all code upon acceptance.
Open Datasets Yes We evaluate our method on three datasets: Matterport3D (MP3D) [5], HM3D [31] and Robo THOR [9].
Dataset Splits Yes We test our SG-Nav on the validation set, which contains 11 indoor scenes, 21 object goal categories and 2195 object-goal navigation episodes. HM3D is used in Habitat 2022 Object Nav challenge, containing 2000 validation episodes on 20 validation environments with 6 goal object categories. Robo THOR is used in Robo THOR 2020, 2021 Object Nav challenge, containing 1800 validation episodes on 15 validation environments with 12 goal object categories.
Hardware Specification Yes The evaluation of our model is performed on four RTX 3090 GPUs.
Software Dependencies Yes For the verification of short edge, we adopt LLaVA-1.6 (Mistral-7B) [24] as the VLM for discrimination. We choose LLaMA-7B [36] and GPT-4-0613 as the LLM for our SG-Nav, which are discriminated by SG-Nav-LLaMA and SG-Nav-GPT.
Experiment Setup Yes We set 500 as the maximal navigation steps. The farthest and closest perceived distances of the agent are 10m and 1.5m. Each step of the agent takes 0.25m, and each rotation takes 30°. The camera of agent is 0.90m above the ground, and perspective is horizontal. The camera outputs 640 × 480 RGB-D images. We maintain a 800 × 800 2D occupancy map with the resolution of 0.05m, which can represent a 40m × 40m large-scale scene. In equation 4, the hypre-parameters Nmax is 10 and Sthres is 0.8.