reproducibilityindex.ai

Few-Shot Audio-Visual Learning of Environment Acoustics

Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments using a state-of-the-art audiovisual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and in a major departure from traditional methods generalizing to novel environments in a fewshot manner. We evaluate FEW-SHOTRIR with realistic audio-visual simulations from Sound Spaces [10] comprising 83 real-world Matterport3D [7] environment scans. Our model successfully learns environment acoustics, outperforming the state-of-the-art models in addition to several baselines. We also demonstrate the impact on two downstream tasks that rely on the spatialization accuracy of RIRs: sound source localization and depth prediction.
Researcher Affiliation	Collaboration	Sagnik Majumder1 Changan Chen1,2 Ziad Al-Halah1 Kristen Grauman1,2 1UT Austin 2Facebook AI Research
Pseudocode	No	The paper describes the model architecture and components, but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	Implementation and training details are provided in Supp, and code will be published. ... We will publish our code.
Open Datasets	Yes	We evaluate our task using a state-of-the-art perceptually realistic 3D audiovisual simulator. In particular, we use the AI-Habitat simulator [54] with the Sound Spaces [10] audio and the Matterport3D scenes [7]. ... Matterport3D [7] license available at http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf.
Dataset Splits	Yes	For the seen environments, we hold out a subset of queries Q for testing and use the rest for training and validation. Our test set consists of 14 sets of 50 arbitrary queries for each environment, where our model uses the same randomly chosen observation set for all queries in a set. This results in a train-val split with 8,107,904 queries, and a test split with 39,900 queries for seen and 18,200 queries for unseen.
Hardware Specification	Yes	when both models are trained on 8 NVIDIA Quadro RTX 6000 GPUs.
Software Dependencies	No	The paper mentions using 'Adam [30]' as an optimizer, but it does not specify version numbers for general software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow versions) which are necessary for full reproducibility.
Experiment Setup	Yes	Our final training objective is L = L1 + λLD, where λ is the weight for LD. We train our model using Adam [30] with a learning rate of 10 4 and λ = 10 2. ... We render all RGB-D images for our model input at a resolution of 128 128 and sample binaural RIRs at a rate of 16 k Hz. To generate the RIR spectrograms, we compute the STFT with a Hann window of 15.5 ms, hop length of 3.875 ms, and FFT size of 511. ... we use egocentric observation sets of size N = 20 samples for our model.