reproducibilityindex.ai

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

Authors: Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.
Researcher Affiliation	Collaboration	1Sony Group Corporation 2University of California San Diego
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	For reproducibility, all source code, hyperparameters and pretrained models are available at: https://github.com/sony/CLIPSep.
Open Datasets	Yes	We use the VGGSound dataset (Chen et al., 2020)... We also evaluate the proposed CLIPSep model without the noise invariant training on musical instrument sound separation task using the MUSIC dataset, as done in (Zhao et al., 2018).
Dataset Splits	No	We validate the model every 10,000 steps using image queries as we do not assume labeled data is available for the validation set. The paper states that a validation set is used, but does not provide specific details on its size, percentage, or how it's split from the main dataset.
Hardware Specification	No	The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for the experiments.
Software Dependencies	No	We implement all the models using the PyTorch library (Paszke et al., 2019). We compute the signal-to-distortion ratio (SDR) using museval (St oter et al., 2018). While software libraries are mentioned, specific version numbers for these dependencies are not provided in the text.
Experiment Setup	Yes	We implement the audio model as a 7-layer U-Net (Ronneberger et al., 2015). We use k = 32. We use binary masks as the ground truth masks during training while using the raw, real-valued masks for evaluation. We train all the models for 200,000 steps with a batch size of 32. We use the Adam optimizer (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.999 and ϵ = 10^-8. In addition, we clip the norm of the gradients to 1.0. We adopt the following learning rate schedule with a warm-up the learning rate starts from 0 and grows to 0.001 after 5,000 steps, and then it linearly drops to 0.0001 at 100,000 steps and keeps this value thereafter. We use a sampling rate of 16,000 Hz and work on audio clips of length 65,535 samples (≈ 4 seconds). For the spectrogram computation, we use a filter length of 1024, a hop length of 256 and a window size of 1024 in the short-time Fourier transform (STFT). We resize images extracted from video to a size of 224-by-224 pixels.