AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments
Authors: Sudipta Paul, Amit Roy-Chowdhury, Anoop Cherian
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To empirically evaluate AVLEN, we present experiments on the Sound Spaces framework for semantic audio-visual navigation tasks. Our results show that equipping the agent to ask for help leads to a clear improvement in performance, especially in challenging cases, e.g., when the sound is unheard during training or in the presence of distractor sounds. |
| Researcher Affiliation | Collaboration | Sudipta Paul1 Amit K. Roy-Chowdhury1 Anoop Cherian2 spaul007@ucr.edu amitrc@ece.ucr.edu cherian@merl.com 1University of California, Riverside 2Mitsubishi Electric Research Labs, Cambridge, MA |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper's main text does not include an explicit statement about releasing code or a direct link to a code repository. While the NeurIPS checklist indicates instructions are in supplementary material, this is not within the body of the paper itself. |
| Open Datasets | Yes | To benchmark our experiments, we use the semantic audio-visual navigation dataset from Chen et al. [5] built over Sound Spaces. This dataset consists of sounds from 21 semantic categories of objects that are visually present in the Matterport3D scans... There are 0.5M/500/1000 episodes available in this dataset for train/val/test splits respectively from 85 Matterport3D scans. |
| Dataset Splits | Yes | There are 0.5M/500/1000 episodes available in this dataset for train/val/test splits respectively from 85 Matterport3D scans. |
| Hardware Specification | No | The paper states "Refer to the Appendix for more details" under Implementation Details and the NeurIPS checklist mentions "Provided in supplementary material" for compute resources, but the main text does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments. |
| Software Dependencies | No | The paper mentions "Training uses ADAM [18] with learning rate 2.5 10 4." but does not specify version numbers for any programming languages, libraries, or other software dependencies. It also states "Refer to the Appendix for more details". |
| Experiment Setup | Yes | The memory size for πg and πq is 150 and for πℓis 3. All the experiments consider maximum K = 3 allowed queries (unless otherwise specified). For each query, the agent will take ν = 3 navigation steps in the environment using the natural language instruction. We use a vocabulary with 1621 words. Training uses ADAM [18] with learning rate 2.5 10 4. For the πg policy, we assign a reward of +1 to reduce the geodesic distance towards the goal, a +10 reward to complete an episode successfully, i.e., calling the stop action near the Audio Goal, and a penalty of -0.01 per time step to encourage efficiency. As for the πℓpolicy, we set a negative reward each time the agent queries the oracle, denoted ζq, as well as when the query is made within τ steps from previous query, denoted ζf, where rneg is set to -1.2, and rf is set to -0.5. |