SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Authors: Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, Yoshua Bengio

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. We also show how each component of the model contributes to the exhibited performance.
Researcher Affiliation Academia Soroush Mehri University of Montreal Kundan Kumar IIT Kanpur Ishaan Gulrajani University of Montreal Rithesh Kumar SSNCE Shubham Jain IIT Kanpur Jose Sotelo University of Montreal Aaron Courville University of Montreal CIFAR Fellow Yoshua Bengio University of Montreal CIFAR Senior Fellow
Pseudocode No The paper describes the model using mathematical equations and textual explanations, but it does not include pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Code https://github.com/soroushmehr/sampleRNN_ICLR2017
Open Datasets Yes Blizzard which is a dataset presented by Prahallad et al. (2013) for speech synthesis task, contains 315 hours of a single female voice actor in English; however, for our experiments we are using only 20.5 hours. Music dataset is the collection of all 32 Beethoven s piano sonatas publicly available on https://archive.org/ amounting to 10 hours of non-vocal audio.
Dataset Splits Yes The training/validation/test split is 86%-7%-7%. Onomatopoeia3, a relatively small dataset with 6,738 sequences adding up to 3.5 hours... The training/validation/test split is 92%-4%-4%. Music dataset... The training/validation/test split is 88%-6%-6%.
Hardware Specification Yes We trained these models for about one week on a Ge Force GTX TITAN X.
Software Dependencies No The paper mentions the use of Adam optimizer, Weight Normalization, and Theano (with a reference to 'Theano Development Team (2016)'), but it does not provide specific version numbers for any software libraries or dependencies used in the experiments.
Experiment Setup Yes All the models have been trained with teacher forcing and stochastic gradient decent (mini-batch size 128) to minimize the Negative Log-Likelihood (NLL) in bits per dimension (per audio sample). Gradients were hard-clipped to remain in [-1, 1] range. Update rules from the Adam optimizer (Kingma & Ba, 2014) (β1 = 0.9, β2 = 0.999, and ϵ = 1e 8) with an initial learning rate of 0.001 was used to adjust the parameters. For training each model, random search over hyper-parameter values (Bergstra & Bengio, 2012) was conducted. Size of the embedding layer was 256 and initialized by standard normal distribution. Orthogonal weight matrices used for hidden-to-hidden connections and other weight matrices initialized similar to He et al. (2015). 1024 was the the number of hidden units for all GRUs (1 layer per tier for 3-tier and 3 layer for 2-tier model) and MLPs (3 fully connected layers with Re LU activation with output dimension being 1024 for first two layers and 256 for the final layer before softmax). Also FS(1) = FS(2) = 2 and FS(3) = 8 were found to result in lowest NLL.