A Closer Look at Weakly-Supervised Audio-Visual Source Localization

Authors: Shentong Mo, Pedro Morgado

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using the new protocol, we conducted an extensive evaluation of prior methods, and found that most prior works are not capable of identifying negatives and suffer from significant overfitting problems (rely heavily on early stopping for best results). We also propose a new approach for visual sound source localization that addresses both these problems.
Researcher Affiliation Academia Shentong Mo Carnegie Mellon University Pedro Morgado University of Wisconsin-Madison
Pseudocode No The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code Yes Code and pre-trained models are available at https://github.com/stone Mo/SLAVC.
Open Datasets Yes We evaluate the effectiveness of the proposed method on two datasets Flickr Sound Net [1] and VGG Sound Sources [45].
Dataset Splits No The paper mentions using "a subset of 144k samples for training" and "extended test sets", and discusses validating the model for early stopping, but does not provide explicit details on the size or composition of a separate validation split.
Hardware Specification No Models are trained with a batch size of 128 on 2 GPUs for 20 epochs (which we found to be enough to achieve convergence in most cases).
Software Dependencies No Our implementation, available at https://github.com/stone Mo/SLAVC, is based on Py Torch [49] deep learning tool.
Experiment Setup Yes The visual encoder is initialized with Image Net [47] pre-trained weights [6, 9, 5]. The output dimensions of the audio and visual encoders (i.e., the output of projection functions g()) was kept at 512, the momentum encoders update factor at 0.999, and the visual dropout at 0.9. No audio dropout is applied. Models are trained with a batch size of 128 on 2 GPUs for 20 epochs... We used the Adam [48] optimizer with β1 = 0.9, β2 = 0.999, learning rate of 1e 4 and weight decay of 1e 4.