CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos
Authors: Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings. |
| Researcher Affiliation | Collaboration | 1Sony Group Corporation 2University of California San Diego |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | For reproducibility, all source code, hyperparameters and pretrained models are available at: https://github.com/sony/CLIPSep. |
| Open Datasets | Yes | We use the VGGSound dataset (Chen et al., 2020)... We also evaluate the proposed CLIPSep model without the noise invariant training on musical instrument sound separation task using the MUSIC dataset, as done in (Zhao et al., 2018). |
| Dataset Splits | No | We validate the model every 10,000 steps using image queries as we do not assume labeled data is available for the validation set. The paper states that a validation set is used, but does not provide specific details on its size, percentage, or how it's split from the main dataset. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for the experiments. |
| Software Dependencies | No | We implement all the models using the PyTorch library (Paszke et al., 2019). We compute the signal-to-distortion ratio (SDR) using museval (St oter et al., 2018). While software libraries are mentioned, specific version numbers for these dependencies are not provided in the text. |
| Experiment Setup | Yes | We implement the audio model as a 7-layer U-Net (Ronneberger et al., 2015). We use k = 32. We use binary masks as the ground truth masks during training while using the raw, real-valued masks for evaluation. We train all the models for 200,000 steps with a batch size of 32. We use the Adam optimizer (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.999 and ϵ = 10^-8. In addition, we clip the norm of the gradients to 1.0. We adopt the following learning rate schedule with a warm-up the learning rate starts from 0 and grows to 0.001 after 5,000 steps, and then it linearly drops to 0.0001 at 100,000 steps and keeps this value thereafter. We use a sampling rate of 16,000 Hz and work on audio clips of length 65,535 samples (≈ 4 seconds). For the spectrogram computation, we use a filter length of 1024, a hop length of 256 and a window size of 1024 in the short-time Fourier transform (STFT). We resize images extracted from video to a size of 224-by-224 pixels. |