reproducibilityindex.ai

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Authors: Shentong Mo, Pedro Morgado

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on MUSIC, VGG-Instruments, VGGMusic, and VGGSound datasets demonstrate the effectiveness of One AVM for all three tasks, audiovisual source localization, separation, and nearest neighbor recognition, and empirically demonstrate a strong positive transfer between them.
Researcher Affiliation	Academia	1Carnegie Mellon University 2University of Wisconsin Madison, Department of Electrical and Computer Eng.
Pseudocode	No	The paper provides an illustration of the framework (Figure 2) but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/stone Mo/One AVM
Open Datasets	Yes	We conducted experiments on the following audio-visual datasets. 1) MUSIC (Zhao et al., 2018) ... 2) VGGSound-Instruments (Hu et al., 2022) ... 3) We composed another more challenging musical subset from VGG-Sound (Chen et al., 2020b) ... 4) Beyond the musical datasets, we used 150k video clips from 221 categories in VGG-Sound (Chen et al., 2020b) ... 5) We also used the Kinetics-400 dataset (Carreira & Zisserman, 2017)
Dataset Splits	Yes	MUSIC: We use 358 solo videos for training and 90 solo videos for evaluation. VGGSound-Instruments: 32k video clips... for training and 446 videos for testing. VGGSound-Music: 40,908 video clips... for training and 1201 clips for testing. VGGSound-All: For testing, we used the full VGG-Sound Source test set, which contains 5158 videos with source localization annotations.
Hardware Specification	No	The paper does not specify the hardware used for experiments, such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions using the Adam optimizer but does not specify version numbers for any software dependencies like programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup	Yes	The models were trained for 20 epochs using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e 4 and a batch size of 128. Unless other specified, the decoder depth for mixed audio separation was set to 8, and the mixing coefficient for mixed visual alignment was set to α = 0.5.