Learning Neural Acoustic Fields
Authors: Andrew Luo, Yilun Du, Michael Tarr, Josh Tenenbaum, Antonio Torralba, Chuang Gan
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we demonstrate that our model can faithfully represent the acoustic impulse response at seen and unseen locations. Additional ablation studies verify the importance of utilizing local geometric features to enable test time generation fidelity. Next, we demonstrate that learning acoustic fields could facilitate improved visual representations when training images are sparse. Finally we show that the learned NAF can be used to infer scene structure. |
| Researcher Affiliation | Collaboration | Andrew Luo Carnegie Mellon University Yilun Du Massachusetts Institute of Technology Michael J. Tarr Carnegie Mellon University Joshua B. Tenenbaum MIT BCS, CBMM, CSAIL Antonio Torralba Massachusetts Institute of Technology Chuang Gan UMass Amherst and MIT-IBM Watson AI Lab |
| Pseudocode | No | The paper describes algorithms and methods using text and diagrams (e.g., Figure 2) but does not include any formally labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Project site: https://www.andrew.cmu.edu/user/afluo/Neural_Acoustic_Fields |
| Open Datasets | Yes | For evaluating the learned acoustic fields, we use two different datasets: Soundspaces. Soundspaces [Chen et al., 2020, Straub et al., 2019] is a synthetic dataset generated via ray-tracing... Mesh RIR. The Mesh RIR dataset [Koyama et al., 2021] is recorded from a real scene... |
| Dataset Splits | No | The paper states: 'For each scene, we holdout 10% of the RIRs randomly as a test set.' This specifies the test split percentage, but does not explicitly detail the training or validation splits, or how the remaining data is partitioned or used for validation, which is needed for full reproducibility of all splits. |
| Hardware Specification | Yes | Each scene is trained for 200 epochs, which takes around 6 hours for the largest scenes on four Nvidia V100s. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers for ancillary software components (e.g., deep learning frameworks like PyTorch or TensorFlow, along with their versions). |
| Experiment Setup | Yes | In each batch, we sample 20 impulse responses, and randomly select 2,000 frequency & time pairs within each spectrogram. An initial learning rate of 5e-4 is used for the network and the grid features. We add a small amount of noise sampled from N(0, 0.1) to each coordinate during training to prevent degenerate solutions. We utilize 64 coarse samples and 128 fine samples for each ray, and sample 1024 rays per batch. |