ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Authors: Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, Dhruv Batra

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate our agents on three Object Nav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% 20.0% over existing zero-shot methods.
Researcher Affiliation Academia Arjun Majumdar , Gunjan Aggarwal , Bhavika Devnani, Judy Hoffman, Dhruv Batra Georgia Institute of Technology
Pseudocode No The paper describes algorithms and architectures but does not include any pseudocode or algorithm blocks.
Open Source Code No Source code for reproducing our results will be publicly released.
Open Datasets Yes We generate a dataset for training our Semantic Nav agent using the 800 training environments from HM3D [20].
Dataset Splits No The paper mentions evaluating on "validation scenes" from various datasets (e.g., Image Nav (Gibson) consists of 4,200 episodes from 14 Gibson [4] validation scenes) and selecting the best checkpoint based on "Object Nav validation success rate (SR)", but it does not specify the train/validation split (e.g., percentages or counts) for its primary training data (7.2M episodes generated from HM3D).
Hardware Specification Yes Each training run was conducted on a single compute node with 8 NVIDIA A40 GPUs.
Software Dependencies No We train agents using Py Torch [35] and the Habitat simulator [2, 3]. No specific version numbers for PyTorch or Habitat simulator are provided.
Experiment Setup Yes During RL training, we use two data augmentation techniques: color jitter and random translation (adapted from [16]). Specifically, we train with DD-PPO [32] using a reward function proposed for Image Nav by Al-Halah et al. [18]: rt = rsuccess + rangle-success dtg atg + rslack (1) where rsuccess = 5 if STOP is called when the agent is within 1m of the goal position (and 0 otherwise), rangle-success = 5 if STOP is called when the agent is within 1m of the goal position and the agent is pointing within 25 of the goal heading i.e., the direction the camera was pointing when the goal image was collected (and 0 otherwise), dtg is the change in the agent s distance-to-goal i.e., the geodesic distance to the goal position, atg is the change in the agent s angle-to-goal i.e., the difference between the agent s heading and the goal heading but is set to 0 if the agent is greater than 1m from the goal, and rslack = 0.01 to encourage efficient navigation.