Learning to Set Waypoints for Audio-Visual Navigation
Authors: Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our approach on two challenging datasets of real-world 3D scenes, Replica and Matterport3D. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation. Project: http://vision.cs.utexas.edu/ projects/audio_visual_waypoints. Table 1 shows the results. We refer to our model as AV-Wa N (Audio-Visual Waypoint Navigation). Random does poorly due to the challenging nature of the Audio Goal task and the complex 3D environments. For the heard sound, AV-Wa N strongly outperforms all the other methods with 8.4% and 29% SPL gains on Replica compared to Chen et al. and Gan et al., and 17.2% and 49.5% gains on Matterport. |
| Researcher Affiliation | Collaboration | Changan Chen1,2 Sagnik Majumder1 Ziad Al-Halah1 Ruohan Gao1,2 Santhosh K. Ramakrishnan1,2 Kristen Grauman1,2 1UT Austin 2Facebook AI Research |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project link (http://vision.cs.utexas.edu/ projects/audio_visual_waypoints) but no explicit statement that the source code is hosted there or a direct link to a code repository. |
| Open Datasets | Yes | We use the AI-Habitat simulator (Savva et al., 2019) with the publicly available Replica (Straub et al., 2019) and Matterport3D (Chang et al., 2017) environments together with the public Sound Spaces audio simulations (Chen et al., 2020). |
| Dataset Splits | Yes | We follow the protocol of the Sound Spaces Audio Goal benchmark (Chen et al., 2020), with train/val/test splits of 9/4/5 scenes on Replica and 73/11/18 scenes on Matterport3D. |
| Hardware Specification | No | The paper does not specify any particular GPU, CPU models, or other detailed hardware specifications used for running experiments. |
| Software Dependencies | No | The paper mentions 'Python 3.8' but does not provide specific version numbers for other key software components or libraries (e.g., Habitat, Sound Spaces, PPO, Adam, GRU) used in the experiments. |
| Experiment Setup | Yes | We train our model with Adam (Kingma & Ba, 2014) with a learning rate of 2.5 10 4. The output of the three encoders gt, bt and at are all of dimension 512. We use a one-layer bidirectional GRU (Chung et al., 2015) with 512 hidden units that takes [gt, bt, at] as input. The geometric map size sg is 200 at a resolution of 0.1m. The acoustic map size sa and the action map size sw are 20 and 9 respectively, at the same resolution as the environment. We use an entropy loss on the policy distribution with coefficient 0.02. We train for 7.5 million policy prediction steps, and we set the upper limit of planning steps to 10. |