A Recurrent Latent Variable Model for Sequential Data

Authors: Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C. Courville, Yoshua Bengio

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate the proposed model against other related sequential models on four speech datasets and one handwriting dataset. Our results show the important roles that latent random variables can play in the RNN dynamics. ... We evaluate the proposed VRNN model on two tasks: (1) modelling natural speech directly from the raw audio waveforms; (2) modelling handwriting generation.
Researcher Affiliation Academia Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, Yoshua Bengio Department of Computer Science and Operations Research Universit e de Montr eal CIFAR Senior Fellow {firstname.lastname}@umontreal.ca
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at http://www.github.com/jych/nips2015_vrnn
Open Datasets Yes We evaluate the models on the following four speech datasets: 1. Blizzard: This text-to-speech dataset made available by the Blizzard Challenge 2013 contains 300 hours of English, spoken by a single female speaker [10]. 2. TIMIT: This widely used dataset for benchmarking speech recognition systems contains 6, 300 English sentences, read by 630 speakers. 3. Onomatopoeia2: This is a set of 6, 738 non-linguistic human-made sounds such as coughing, screaming, laughing and shouting, recorded from 51 voice actors. 4. Accent: This dataset contains English paragraphs read by 2, 046 different native and nonnative English speakers [19]. ... Handwriting generation We let each model learn a sequence of (x, y) coordinates together with binary indicators of pen-up/pen-down, using the IAM-On DB dataset, which consists of 13, 040 handwritten lines written by 500 writers [14].
Dataset Splits Yes Except the TIMIT dataset, the rest of the datasets do not have predefined train/test splits. We shuffle and divide the data into train/validation/test splits using a ratio of 0.9/0.05/0.05. ... The final model was chosen with early-stopping based on the validation performance.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions general setup parameters.
Software Dependencies No The paper mentions
Experiment Setup Yes The only preprocessing used in our experiments is normalizing each sequence using the global mean and standard deviation computed from the entire training set. We train each model with stochastic gradient descent on the negative log-likelihood using the Adam optimizer [12], with a learning rate of 0.001 for TIMIT and Accent and 0.0003 for the rest. We use a minibatch size of 128 for Blizzard and Accent and 64 for the rest. ... We fix each model to have a single recurrent hidden layer with 2000 LSTM units (in the case of Blizzard, 4000 and for IAM-On DB, 1200). All of ϕτ shown in Eqs. (5) (7), (9) have four hidden layers using rectified linear units [15] (for IAM-On DB, we use a single hidden layer). ... Note that we use 20 mixture components for models using a GMM as the output function.