Self-Supervised Visual Acoustic Matching
Authors: Arjun Somayazulu, Changan Chen, Kristen Grauman
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed Le MARA model outperforms existing approaches [3, 7, 14] on challenging in-the-wild audio and environments from multiple datasets. Further, to benchmark this task, we introduce a high audio-visual correspondence subset of the AVSpeech [10] video dataset. |
| Researcher Affiliation | Collaboration | Arjun Somayazulu1 Changan Chen1 Kristen Grauman1,2 1UT Austin 2FAIR, Meta |
| Pseudocode | Yes | 7 Supplementary (...) 10. Pseudocode for a discriminator training epoch detailing our reverberator update mechanism (Algorithm 1) |
| Open Source Code | No | Project page: https: //vision.cs.utexas.edu/projects/ss_vam and We plan to release our code to facilitate further research. |
| Open Datasets | Yes | We use two datasets: Sound Spaces-Speech [7] and AVSpeech [10]. |
| Dataset Splits | Yes | Sound Spaces-Speech consists of anechoic speech samples from Libri Speech paired with their acoustically-correct reverberated waveform (rendered using Sound Spaces) in any of 82 unique environments, together with an RGBD image at the listener s position. (...) We use train/val/test splits of 28,853/280/1,489 samples. (...) Our final set consists of 72,615/1,911/1,911 train/val/test samples. |
| Hardware Specification | Yes | Compute All models are trained on 8 NVIDIA Quadro RTX 6000 GPUs. |
| Software Dependencies | No | The paper mentions software components like "speechbrain Metric GAN-U implementation [34]" and "Wave Net-like architecture" but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train Le MARA using the combined acoustic residue metric with α = 0.7. (...) We train the reverberators with batch size 4 and a learning rate of 1e-2 in stage (2). During stage (3) fine=tuning, we use batch size 2 and a learning rate of 1e-6. (...) In stage (1) (...) we train with batch size 32. During stage (3) fine-tuning, we use a batch size of 2. G and D are trained with learning rates of 2e-6 and 5e-4 respectively in both stages. (...) We clip each audio sample to 2.56s during training and evaluation. |