Self-Supervised Visual Acoustic Matching

Authors: Arjun Somayazulu, Changan Chen, Kristen Grauman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed Le MARA model outperforms existing approaches [3, 7, 14] on challenging in-the-wild audio and environments from multiple datasets. Further, to benchmark this task, we introduce a high audio-visual correspondence subset of the AVSpeech [10] video dataset.
Researcher Affiliation Collaboration Arjun Somayazulu1 Changan Chen1 Kristen Grauman1,2 1UT Austin 2FAIR, Meta
Pseudocode Yes 7 Supplementary (...) 10. Pseudocode for a discriminator training epoch detailing our reverberator update mechanism (Algorithm 1)
Open Source Code No Project page: https: //vision.cs.utexas.edu/projects/ss_vam and We plan to release our code to facilitate further research.
Open Datasets Yes We use two datasets: Sound Spaces-Speech [7] and AVSpeech [10].
Dataset Splits Yes Sound Spaces-Speech consists of anechoic speech samples from Libri Speech paired with their acoustically-correct reverberated waveform (rendered using Sound Spaces) in any of 82 unique environments, together with an RGBD image at the listener s position. (...) We use train/val/test splits of 28,853/280/1,489 samples. (...) Our final set consists of 72,615/1,911/1,911 train/val/test samples.
Hardware Specification Yes Compute All models are trained on 8 NVIDIA Quadro RTX 6000 GPUs.
Software Dependencies No The paper mentions software components like "speechbrain Metric GAN-U implementation [34]" and "Wave Net-like architecture" but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We train Le MARA using the combined acoustic residue metric with α = 0.7. (...) We train the reverberators with batch size 4 and a learning rate of 1e-2 in stage (2). During stage (3) fine=tuning, we use batch size 2 and a learning rate of 1e-6. (...) In stage (1) (...) we train with batch size 32. During stage (3) fine-tuning, we use a batch size of 2. G and D are trained with learning rates of 2e-6 and 5e-4 respectively in both stages. (...) We clip each audio sample to 2.56s during training and evaluation.