reproducibilityindex.ai

Weakly-supervised Audio Separation via Bi-modal Semantic Similarity

Authors: Tanvir Mahmud, Saeed Amizadeh, Kazuhito Koishida, Diana Marculescu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that this is well within reach if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP). Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance separation performance. First, we show that our proposed methodology significantly improves the performance... Second, we show that we can further improve the performance...
Researcher Affiliation	Collaboration	Tanvir Mahmud1 , Saeed Amizadeh2 , Kazuhito Koishida2 & Diana Marculescu1 1The University of Texas at Austin, USA, 2Microsoft Corporation {tanvirmahmud, dianam}@utexas.edu, {saamizad, kazukoi}@microsoft.com
Pseudocode	Yes	The complete training algorithm of the proposed framework is illustrated in Algorithm 1.
Open Source Code	Yes	Code is released at https://github.com/microsoft/Bi Modal Audio Separation/.
Open Datasets	Yes	Datasets: We experiment on synthetic mixtures produced from single source MUSIC (Zhao et al., 2018) and VGGSound (Chen et al., 2020) datasets by mixing samples from n sources. We use the same test set containing samples of 2 sources for each dataset in all experiments. We also experiment with Audio Caps (Kim et al., 2019), a natural mixture dataset containing 1 6 sounding sources in each mixture with full-length captions...
Dataset Splits	No	For MUSIC, we use 80% videos of each classes for training and the remaining for testing. For VGGSound, we use the official train and test split that contains 162,199 training videos and 13,398 test videos. For Audio Caps, we use the official train and test splits that contain 45,182 and 4,110 mixtures, respectively. While it is stated 'We validate the model after every training epoch,' the specific split size or methodology for the validation set is not provided.
Hardware Specification	Yes	All the training was carried out with 8 RTX-A6000 GPUs with 48GB memory.
Software Dependencies	No	The paper mentions using 'Py Torch library (Paszke et al., 2019)' and 'Torchaudio package (Yang et al., 2022)' but does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9').
Experiment Setup	Yes	All the models are trained for 50 epochs with initial learning rate of 0.001. The learning rate drops by a factor of 0.1 after every 15 epochs. Adam optimizer (Kingma & Ba, 2014) is used with β1 = 0.9, β2 = 0.999 and ϵ = 10 8 for backpropagation. All the training was carried out with 8 RTX-A6000 GPUs with 48GB memory. We validate the model after every training epoch. We use the batch size of 32 for the MUSIC dataset, and batch size of 64 for the VGGSound and Audio Caps datasets.