A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition
Authors: Shentong Mo, Pedro Morgado
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on MUSIC, VGG-Instruments, VGGMusic, and VGGSound datasets demonstrate the effectiveness of One AVM for all three tasks, audiovisual source localization, separation, and nearest neighbor recognition, and empirically demonstrate a strong positive transfer between them. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University 2University of Wisconsin Madison, Department of Electrical and Computer Eng. |
| Pseudocode | No | The paper provides an illustration of the framework (Figure 2) but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/stone Mo/One AVM |
| Open Datasets | Yes | We conducted experiments on the following audio-visual datasets. 1) MUSIC (Zhao et al., 2018) ... 2) VGGSound-Instruments (Hu et al., 2022) ... 3) We composed another more challenging musical subset from VGG-Sound (Chen et al., 2020b) ... 4) Beyond the musical datasets, we used 150k video clips from 221 categories in VGG-Sound (Chen et al., 2020b) ... 5) We also used the Kinetics-400 dataset (Carreira & Zisserman, 2017) |
| Dataset Splits | Yes | MUSIC: We use 358 solo videos for training and 90 solo videos for evaluation. VGGSound-Instruments: 32k video clips... for training and 446 videos for testing. VGGSound-Music: 40,908 video clips... for training and 1201 clips for testing. VGGSound-All: For testing, we used the full VGG-Sound Source test set, which contains 5158 videos with source localization annotations. |
| Hardware Specification | No | The paper does not specify the hardware used for experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not specify version numbers for any software dependencies like programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries. |
| Experiment Setup | Yes | The models were trained for 20 epochs using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e 4 and a batch size of 128. Unless other specified, the decoder depth for mixed audio separation was set to 8, and the mixing coefficient for mixed visual alignment was set to α = 0.5. |