Weakly-supervised Audio Separation via Bi-modal Semantic Similarity
Authors: Tanvir Mahmud, Saeed Amizadeh, Kazuhito Koishida, Diana Marculescu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that this is well within reach if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP). Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance separation performance. First, we show that our proposed methodology significantly improves the performance... Second, we show that we can further improve the performance... |
| Researcher Affiliation | Collaboration | Tanvir Mahmud1 , Saeed Amizadeh2 , Kazuhito Koishida2 & Diana Marculescu1 1The University of Texas at Austin, USA, 2Microsoft Corporation {tanvirmahmud, dianam}@utexas.edu, {saamizad, kazukoi}@microsoft.com |
| Pseudocode | Yes | The complete training algorithm of the proposed framework is illustrated in Algorithm 1. |
| Open Source Code | Yes | Code is released at https://github.com/microsoft/Bi Modal Audio Separation/. |
| Open Datasets | Yes | Datasets: We experiment on synthetic mixtures produced from single source MUSIC (Zhao et al., 2018) and VGGSound (Chen et al., 2020) datasets by mixing samples from n sources. We use the same test set containing samples of 2 sources for each dataset in all experiments. We also experiment with Audio Caps (Kim et al., 2019), a natural mixture dataset containing 1 6 sounding sources in each mixture with full-length captions... |
| Dataset Splits | No | For MUSIC, we use 80% videos of each classes for training and the remaining for testing. For VGGSound, we use the official train and test split that contains 162,199 training videos and 13,398 test videos. For Audio Caps, we use the official train and test splits that contain 45,182 and 4,110 mixtures, respectively. While it is stated 'We validate the model after every training epoch,' the specific split size or methodology for the validation set is not provided. |
| Hardware Specification | Yes | All the training was carried out with 8 RTX-A6000 GPUs with 48GB memory. |
| Software Dependencies | No | The paper mentions using 'Py Torch library (Paszke et al., 2019)' and 'Torchaudio package (Yang et al., 2022)' but does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9'). |
| Experiment Setup | Yes | All the models are trained for 50 epochs with initial learning rate of 0.001. The learning rate drops by a factor of 0.1 after every 15 epochs. Adam optimizer (Kingma & Ba, 2014) is used with β1 = 0.9, β2 = 0.999 and ϵ = 10 8 for backpropagation. All the training was carried out with 8 RTX-A6000 GPUs with 48GB memory. We validate the model after every training epoch. We use the batch size of 32 for the MUSIC dataset, and batch size of 64 for the VGGSound and Audio Caps datasets. |