MO-DDN: A Coarse-to-Fine Attribute-based Exploration Agent for Multi-Object Demand-driven Navigation
Authors: Hongcheng Wang, Peiqi Liu, Wenzhe Cai, Mingdong Wu, Zhengyu Qian, Hao Dong
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results illustrate that this coarse-to-fine exploration strategy capitalizes on the advantages of attributes at various decision-making levels, resulting in superior performance compared to baseline methods. Code and video can be found at https://sites.google.com/view/moddn. 5 Experiment 5.1 Experimental Settings 5.2 Baselines 5.3 Baseline Comparison 5.4 Ablation Study |
| Researcher Affiliation | Academia | Hongcheng Wang1,3 Peiqi Liu2 Wenzhe Cai4 Mingdong Wu 1,3 Zhengyu Qian 2 Hao Dong 1,3 1CFCS, School of CS, PKU 2School of EECS, PKU 3PKU-Agibot Lab 4School of Automation, Southeast University |
| Pseudocode | Yes | Algorithm 1: Losses in Attribute Training |
| Open Source Code | Yes | Code and video can be found at https://sites.google.com/view/moddn. |
| Open Datasets | Yes | We generate 300 tasks, encompassing 358 object categories from the HSSD dataset [99]. |
| Dataset Splits | Yes | HSSD splits the scenes into val scenes and train scenes (i.e., unseen scenes and seen scenes in Tab. 5.3, respectively). |
| Hardware Specification | Yes | A single RTX 4090 is enough to run the experiments. Our method and baselines can be trained on a single RTX 4090, which will take about one day for each method. |
| Software Dependencies | Yes | We use the standard transformer encoder from the official Py Torch 1.13.1 implementation |
| Experiment Setup | Yes | We use the standard transformer encoder from the official Py Torch 1.13.1 implementation, where d_model is 768, nhead is 8, num_layers is 6, and other parameters remain default. The embedding dim of action is 64. The embedding dim of GPS+Compass is 32. The input dim of LSTM is 768+64+32, its hidden_size is 1024, and its num_layers is 2. The depth model is a simple five-layer CNN model and a two-layer MLP model. Loss = λ1 Attribte Loss + λ2 Matching Loss+ λ3 V Q Loss + λ4 Commit Loss + λ5 Recon Loss where λ1 is 2.0, λ2 is 1.0, λ3 is 1.0, λ4 is 0.25, and λ5 is 1.0. We trained the model on a single RTX 4090 using imitation learning and cross-entropy loss, i.e., considering the action prediction as a classification task, consuming about 12h. |