MO-DDN: A Coarse-to-Fine Attribute-based Exploration Agent for Multi-Object Demand-driven Navigation

Authors: Hongcheng Wang, Peiqi Liu, Wenzhe Cai, Mingdong Wu, Zhengyu Qian, Hao Dong

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results illustrate that this coarse-to-fine exploration strategy capitalizes on the advantages of attributes at various decision-making levels, resulting in superior performance compared to baseline methods. Code and video can be found at https://sites.google.com/view/moddn. 5 Experiment 5.1 Experimental Settings 5.2 Baselines 5.3 Baseline Comparison 5.4 Ablation Study
Researcher Affiliation Academia Hongcheng Wang1,3 Peiqi Liu2 Wenzhe Cai4 Mingdong Wu 1,3 Zhengyu Qian 2 Hao Dong 1,3 1CFCS, School of CS, PKU 2School of EECS, PKU 3PKU-Agibot Lab 4School of Automation, Southeast University
Pseudocode Yes Algorithm 1: Losses in Attribute Training
Open Source Code Yes Code and video can be found at https://sites.google.com/view/moddn.
Open Datasets Yes We generate 300 tasks, encompassing 358 object categories from the HSSD dataset [99].
Dataset Splits Yes HSSD splits the scenes into val scenes and train scenes (i.e., unseen scenes and seen scenes in Tab. 5.3, respectively).
Hardware Specification Yes A single RTX 4090 is enough to run the experiments. Our method and baselines can be trained on a single RTX 4090, which will take about one day for each method.
Software Dependencies Yes We use the standard transformer encoder from the official Py Torch 1.13.1 implementation
Experiment Setup Yes We use the standard transformer encoder from the official Py Torch 1.13.1 implementation, where d_model is 768, nhead is 8, num_layers is 6, and other parameters remain default. The embedding dim of action is 64. The embedding dim of GPS+Compass is 32. The input dim of LSTM is 768+64+32, its hidden_size is 1024, and its num_layers is 2. The depth model is a simple five-layer CNN model and a two-layer MLP model. Loss = λ1 Attribte Loss + λ2 Matching Loss+ λ3 V Q Loss + λ4 Commit Loss + λ5 Recon Loss where λ1 is 2.0, λ2 is 1.0, λ3 is 1.0, λ4 is 0.25, and λ5 is 1.0. We trained the model on a single RTX 4090 using imitation learning and cross-entropy loss, i.e., considering the action prediction as a classification task, consuming about 12h.