reproducibilityindex.ai

Adversarial Audio Synthesis

Authors: Chris Donahue, Julian McAuley, Miller Puckette

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we investigate both waveform and spectrogram strategies for generating one-second slices of audio with GANs.1 For our spectrogram approach (Spec GAN), we ﬁrst design a spectrogram representation that allows for approximate inversion, and bootstrap the two-dimensional deep convolutional GAN (DCGAN) method (Radford et al., 2016) to operate on these spectrograms. In Wave GAN, our waveform approach, we ﬂatten the DCGAN architecture to operate in one dimension, resulting in a model with the same number of parameters and numerical operations as its twodimensional analog. With Wave GAN, we provide both a starting point for practical audio synthesis with GANs and a recipe for modifying other image generation methods to operate on waveforms. Our experiments demonstrate that without labels Wave GAN learns to produce intelligible words when trained on a smallvocabulary speech dataset, and can also synthesize audio from other domains such as drums, bird vocalizations, and piano. We compare Wave GAN to a method which applies GANs designed for image generation on image-like audio feature representations, ﬁnding both approaches to be promising. To facilitate human evaluation, our experimentation focuses on the Speech Commands Dataset (Warden, 2018). Results for our evaluation appear in Table 1.
Researcher Affiliation	Academia	Chris Donahue Department of Music UC San Diego cdonahue@ucsd.edu Julian Mc Auley Department of Computer Science UC San Diego jmcauley@eng.ucsd.edu Miller Puckette Department of Music UC San Diego msp@ucsd.edu
Pseudocode	Yes	In Tables 2 and 3, we list the full architectures for our Wave GAN generator and discriminator respectively. In Tables 4 and 5, we list the same for Spec GAN.
Open Source Code	Yes	Training code: github.com/chrisdonahue/wavegan
Open Datasets	Yes	To facilitate human evaluation, our experimentation focuses on the Speech Commands Dataset (Warden, 2018). Drum sound effects (0.7 hours): Drum samples for kicks, snares, toms, and cymbals 2. Bird vocalizations (12.2 hours): In-the-wild recordings of many species (Boesman, 2018) 3. Piano (0.3 hours): Professional performer playing a variety of Bach compositions 4. Large vocab speech (TIMIT) (2.4 hours): Multiple speakers, clean (Garofolo et al., 1993)
Dataset Splits	Yes	We perform early stopping on the minimum negative log-likelihood of the validation set; the resultant model achieves 93% accuracy on the test set.
Hardware Specification	Yes	We train our networks using batches of size 64 on a single NVIDIA P100 GPU.
Software Dependencies	No	The paper discusses the use of WGAN-GP algorithm and references other GAN libraries, but it does not specify versions for software dependencies such as deep learning frameworks or specific libraries.
Experiment Setup	Yes	We train our networks using batches of size 64 on a single NVIDIA P100 GPU. During our quantitative evaluation of SC09 (discussed below), our Wave GAN networks converge by their early stopping criteria (inception score) within four days (200k iterations, around 3500 epochs), and produce speech-like audio within the ﬁrst hour of training. Our Spec GAN networks converge more quickly, within two days (around 1750 epochs). On the other four datasets, we train Wave GAN for 200k iterations representing nearly 1500 epochs for the largest dataset. Table 6: Wave GAN hyperparameters Name Value Input data type 16-bit PCM (requantized to 32-bit ﬂoat) Model data type 32-bit ﬂoating point Num channels (c) 1 Batch size (b) 64 Model dimensionality (d) 64 Phase shufﬂe (Wave GAN) 2 Phase shufﬂe (Spec GAN) 0 Loss WGAN-GP (Gulrajani et al., 2017) WGAN-GP λ 10 D updates per G update 5 Optimizer Adam (α = 1e 4, β1 = 0.5, β2 = 0.9)