A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Authors: Shentong Mo, Pedro Morgado
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using the new protocol, we conducted an extensive evaluation of prior methods, and found that most prior works are not capable of identifying negatives and suffer from significant overfitting problems (rely heavily on early stopping for best results). We also propose a new approach for visual sound source localization that addresses both these problems. |
| Researcher Affiliation | Academia | Shentong Mo Carnegie Mellon University Pedro Morgado University of Wisconsin-Madison |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | Code and pre-trained models are available at https://github.com/stone Mo/SLAVC. |
| Open Datasets | Yes | We evaluate the effectiveness of the proposed method on two datasets Flickr Sound Net [1] and VGG Sound Sources [45]. |
| Dataset Splits | No | The paper mentions using "a subset of 144k samples for training" and "extended test sets", and discusses validating the model for early stopping, but does not provide explicit details on the size or composition of a separate validation split. |
| Hardware Specification | No | Models are trained with a batch size of 128 on 2 GPUs for 20 epochs (which we found to be enough to achieve convergence in most cases). |
| Software Dependencies | No | Our implementation, available at https://github.com/stone Mo/SLAVC, is based on Py Torch [49] deep learning tool. |
| Experiment Setup | Yes | The visual encoder is initialized with Image Net [47] pre-trained weights [6, 9, 5]. The output dimensions of the audio and visual encoders (i.e., the output of projection functions g()) was kept at 512, the momentum encoders update factor at 0.999, and the visual dropout at 0.9. No audio dropout is applied. Models are trained with a batch size of 128 on 2 GPUs for 20 epochs... We used the Adam [48] optimizer with β1 = 0.9, β2 = 0.999, learning rate of 1e 4 and weight decay of 1e 4. |