Few-Shot Audio-Visual Learning of Environment Acoustics
Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments using a state-of-the-art audiovisual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and in a major departure from traditional methods generalizing to novel environments in a fewshot manner. We evaluate FEW-SHOTRIR with realistic audio-visual simulations from Sound Spaces [10] comprising 83 real-world Matterport3D [7] environment scans. Our model successfully learns environment acoustics, outperforming the state-of-the-art models in addition to several baselines. We also demonstrate the impact on two downstream tasks that rely on the spatialization accuracy of RIRs: sound source localization and depth prediction. |
| Researcher Affiliation | Collaboration | Sagnik Majumder1 Changan Chen1,2 Ziad Al-Halah1 Kristen Grauman1,2 1UT Austin 2Facebook AI Research |
| Pseudocode | No | The paper describes the model architecture and components, but it does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Implementation and training details are provided in Supp, and code will be published. ... We will publish our code. |
| Open Datasets | Yes | We evaluate our task using a state-of-the-art perceptually realistic 3D audiovisual simulator. In particular, we use the AI-Habitat simulator [54] with the Sound Spaces [10] audio and the Matterport3D scenes [7]. ... Matterport3D [7] license available at http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf. |
| Dataset Splits | Yes | For the seen environments, we hold out a subset of queries Q for testing and use the rest for training and validation. Our test set consists of 14 sets of 50 arbitrary queries for each environment, where our model uses the same randomly chosen observation set for all queries in a set. This results in a train-val split with 8,107,904 queries, and a test split with 39,900 queries for seen and 18,200 queries for unseen. |
| Hardware Specification | Yes | when both models are trained on 8 NVIDIA Quadro RTX 6000 GPUs. |
| Software Dependencies | No | The paper mentions using 'Adam [30]' as an optimizer, but it does not specify version numbers for general software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow versions) which are necessary for full reproducibility. |
| Experiment Setup | Yes | Our final training objective is L = L1 + λLD, where λ is the weight for LD. We train our model using Adam [30] with a learning rate of 10 4 and λ = 10 2. ... We render all RGB-D images for our model input at a resolution of 128 128 and sample binaural RIRs at a rate of 16 k Hz. To generate the RIR spectrograms, we compute the STFT with a Hann window of 15.5 ms, hop length of 3.875 ms, and FFT size of 511. ... we use egocentric observation sets of size N = 20 samples for our model. |