Adversarial Audio Synthesis
Authors: Chris Donahue, Julian McAuley, Miller Puckette
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we investigate both waveform and spectrogram strategies for generating one-second slices of audio with GANs.1 For our spectrogram approach (Spec GAN), we first design a spectrogram representation that allows for approximate inversion, and bootstrap the two-dimensional deep convolutional GAN (DCGAN) method (Radford et al., 2016) to operate on these spectrograms. In Wave GAN, our waveform approach, we flatten the DCGAN architecture to operate in one dimension, resulting in a model with the same number of parameters and numerical operations as its twodimensional analog. With Wave GAN, we provide both a starting point for practical audio synthesis with GANs and a recipe for modifying other image generation methods to operate on waveforms. Our experiments demonstrate that without labels Wave GAN learns to produce intelligible words when trained on a smallvocabulary speech dataset, and can also synthesize audio from other domains such as drums, bird vocalizations, and piano. We compare Wave GAN to a method which applies GANs designed for image generation on image-like audio feature representations, finding both approaches to be promising. To facilitate human evaluation, our experimentation focuses on the Speech Commands Dataset (Warden, 2018). Results for our evaluation appear in Table 1. |
| Researcher Affiliation | Academia | Chris Donahue Department of Music UC San Diego cdonahue@ucsd.edu Julian Mc Auley Department of Computer Science UC San Diego jmcauley@eng.ucsd.edu Miller Puckette Department of Music UC San Diego msp@ucsd.edu |
| Pseudocode | Yes | In Tables 2 and 3, we list the full architectures for our Wave GAN generator and discriminator respectively. In Tables 4 and 5, we list the same for Spec GAN. |
| Open Source Code | Yes | Training code: github.com/chrisdonahue/wavegan |
| Open Datasets | Yes | To facilitate human evaluation, our experimentation focuses on the Speech Commands Dataset (Warden, 2018). Drum sound effects (0.7 hours): Drum samples for kicks, snares, toms, and cymbals 2. Bird vocalizations (12.2 hours): In-the-wild recordings of many species (Boesman, 2018) 3. Piano (0.3 hours): Professional performer playing a variety of Bach compositions 4. Large vocab speech (TIMIT) (2.4 hours): Multiple speakers, clean (Garofolo et al., 1993) |
| Dataset Splits | Yes | We perform early stopping on the minimum negative log-likelihood of the validation set; the resultant model achieves 93% accuracy on the test set. |
| Hardware Specification | Yes | We train our networks using batches of size 64 on a single NVIDIA P100 GPU. |
| Software Dependencies | No | The paper discusses the use of WGAN-GP algorithm and references other GAN libraries, but it does not specify versions for software dependencies such as deep learning frameworks or specific libraries. |
| Experiment Setup | Yes | We train our networks using batches of size 64 on a single NVIDIA P100 GPU. During our quantitative evaluation of SC09 (discussed below), our Wave GAN networks converge by their early stopping criteria (inception score) within four days (200k iterations, around 3500 epochs), and produce speech-like audio within the first hour of training. Our Spec GAN networks converge more quickly, within two days (around 1750 epochs). On the other four datasets, we train Wave GAN for 200k iterations representing nearly 1500 epochs for the largest dataset. Table 6: Wave GAN hyperparameters Name Value Input data type 16-bit PCM (requantized to 32-bit float) Model data type 32-bit floating point Num channels (c) 1 Batch size (b) 64 Model dimensionality (d) 64 Phase shuffle (Wave GAN) 2 Phase shuffle (Spec GAN) 0 Loss WGAN-GP (Gulrajani et al., 2017) WGAN-GP λ 10 D updates per G update 5 Optimizer Adam (α = 1e 4, β1 = 0.5, β2 = 0.9) |