Catch-A-Waveform: Learning to Generate Audio from a Single Short Example

Authors: Gal Greshler, Tamar Shaham, Tomer Michaeli

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our catch-a-waveform (CAW) method in several applications and evaluate it both qualitatively and quantitatively. Our training examples contain a variety of audio types, including polyphonic rock and pop music, monophonic instrumental music, speech, and ambient sounds.
Researcher Affiliation Academia Gal Greshler Technion Israel Institute of Technology galgreshler@gmail.com Tamar Rott Shaham Technion Israel Institute of Technology stamarot@campus.technion.ac.il Tomer Michaeli Technion Israel Institute of Technology tomer.m@ee.technion.ac.il
Pseudocode No The paper describes the model in detail and includes figures, but no formal pseudocode or algorithm blocks are provided.
Open Source Code Yes 1code is available at https://github.com/galgreshler/Catch-A-Waveform
Open Datasets Yes We compare our BE results to the the state-of-the-art temporal Fi LM (TFi LM) method [6], which requires a large training set to perform this task. We use the VCTK dataset, and report both the signal to noise ratio (SNRs) and the log spectral distance (LSD) [17] between the recovered signal and the ground-truth one, averaged over a test set.
Dataset Splits No The paper mentions training on short signals (e.g., 20-25 seconds) and testing on held-out data or specific test sets (e.g., 'held-out sentences of the same speaker' for VCTK), but does not provide specific percentages, sample counts, or clear references to predefined train/validation/test splits for their experiments.
Hardware Specification Yes Training on a 25 second long signal takes about 10 hours on Nvidia Ge Force RTX 2080.
Software Dependencies No The paper mentions using the Adam optimizer, but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes For training, we use the Adam optimizer [28] with (β1, β2) = (0.5, 0.999) and learning rate 0.0015, which we reduce by a factor of 10 after two thirds of the epochs (we run a total of 3000 epochs).