A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Authors: Shentong Mo, Pedro Morgado

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on MUSIC, VGG-Instruments, VGGMusic, and VGGSound datasets demonstrate the effectiveness of One AVM for all three tasks, audiovisual source localization, separation, and nearest neighbor recognition, and empirically demonstrate a strong positive transfer between them.
Researcher Affiliation Academia 1Carnegie Mellon University 2University of Wisconsin Madison, Department of Electrical and Computer Eng.
Pseudocode No The paper provides an illustration of the framework (Figure 2) but does not include any pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/stone Mo/One AVM
Open Datasets Yes We conducted experiments on the following audio-visual datasets. 1) MUSIC (Zhao et al., 2018) ... 2) VGGSound-Instruments (Hu et al., 2022) ... 3) We composed another more challenging musical subset from VGG-Sound (Chen et al., 2020b) ... 4) Beyond the musical datasets, we used 150k video clips from 221 categories in VGG-Sound (Chen et al., 2020b) ... 5) We also used the Kinetics-400 dataset (Carreira & Zisserman, 2017)
Dataset Splits Yes MUSIC: We use 358 solo videos for training and 90 solo videos for evaluation. VGGSound-Instruments: 32k video clips... for training and 446 videos for testing. VGGSound-Music: 40,908 video clips... for training and 1201 clips for testing. VGGSound-All: For testing, we used the full VGG-Sound Source test set, which contains 5158 videos with source localization annotations.
Hardware Specification No The paper does not specify the hardware used for experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions using the Adam optimizer but does not specify version numbers for any software dependencies like programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup Yes The models were trained for 20 epochs using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e 4 and a batch size of 128. Unless other specified, the decoder depth for mixed audio separation was set to 8, and the mixing coefficient for mixed visual alignment was set to α = 0.5.