GANSynth: Adversarial Neural Audio Synthesis
Authors: Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, Adam Roberts
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive empirical investigations on the NSynth dataset, we demonstrate that GANs are able to outperform strong Wave Net baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts. |
| Researcher Affiliation | Industry | Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, & Adam Roberts Google AI Mountain View, CA 9043, USA |
| Pseudocode | No | The paper does not contain pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Online resources: Colab Notebook: http://goo.gl/magenta/gansynth-demo, Audio Examples: http://goo.gl/magenta/gansynth-examples, Code: http://goo.gl/magenta/gansynth-code |
| Open Datasets | Yes | We focus our study on the NSynth dataset, which contains 300,000 musical notes from 1,000 different instruments aligned and recorded in isolation. NSynth is a difficult dataset composed of highly diverse timbres and pitches, but it is also highly structured with labels for pitch, velocity, instrument, and acoustic qualities (Liu et al., 2015; Engel et al., 2017). Each sample is four seconds long, and sampled at 16k Hz, giving 64,000 dimensions. As we wanted to included human evaluations on audio quality, we restricted ourselves to training on the subset of acoustic instruments and fundamental pitches ranging from MIDI 24-84 ( 32-1000Hz), as those timbres are most likely to sound natural to an average listener. This left us with 70,379 examples from instruments that are mostly strings, brass, woodwinds, and mallets. We created a new test/train 80/20 split from shuffled data, as the original split was divided along instrument type, which isn t desirable for this task. 2 https://magenta.tensorflow.org/datasets/nsynth |
| Dataset Splits | Yes | We created a new test/train 80/20 split from shuffled data, as the original split was divided along instrument type, which isn t desirable for this task. |
| Hardware Specification | Yes | We train each GAN variant for 4.5 days on a single V100 GPU, with a batch size of 8. |
| Software Dependencies | No | The paper mentions 'Tensorflow', 'ADAM optimizer (Kingma & Ba, 2014)', 'mulaw encoding', and 'mixture of 10 logistics (Salimans et al., 2017)', but does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | All models were trained with the ADAM optimizer (Kingma & Ba, 2014). We sweep over learning rates (2e-4, 4e-4, 8e-4) and weights of the auxiliary classifier loss (0.1, 1.0, 10), and find that for all variants (spectral representation, progressive/no progressive, frequency resolution) a learning rate of 8e-4 and classifier loss of 10 perform the best. We train each GAN variant for 4.5 days on a single V100 GPU, with a batch size of 8. For nonprogressive models, this equates to training on 5M examples. For progressive models, we train on 1.6M examples per a stage (7 stages), 800k during alpha blending and 800k after blending. |