Z-Forcing: Training Stochastic Recurrent Networks

Authors: Anirudh Goyal ALIAS PARTH GOYAL, Alessandro Sordoni, Marc-Alexandre Côté, Nan Rosemary Ke, Yoshua Bengio

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate our proposed model on diverse modeling tasks (speech, images and text). We show that our model can achieve state-of-the-art results on two speech modeling datasets: Blizzard (King and Karaiskos, 2013) and TIMIT raw audio datasets (also used in Chung et al. (2015)). Our approach also gives competitive results on sequential MNIST (Salakhutdinov and Murray, 2008).
Researcher Affiliation Collaboration Anirudh Goyal MILA, Université de Montréal Alessandro Sordoni Microsoft Maluuba Marc-Alexandre Côté Microsoft Maluuba Nan Rosemary Ke MILA, Polytechnique Montréal Yoshua Bengio MILA, Université de Montréal
Pseudocode No The paper describes the model and learning process using text and mathematical equations (e.g., Eq. 4, 5, 6, 7, 8, 9, 10) but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes Blizzard and TIMIT We test our model in two speech modeling datasets. Blizzard consists in 300 hours of English, spoken by a single female speaker. TIMIT has been widely used in speech recognition and consists in 6300 English sentences read by 630 speakers. We adopt the same train, validation and test split as in Chung et al. (2015). For Blizzard, we report the average log-likelihood for half-second sequences (Fraccaro et al., 2016), while for TIMIT we report the average log-likelihood for the sequences in the test set. Sequential MNIST The task consists in pixel-by-pixel generation of binarized MNIST digits. We use the standard binarized MNIST dataset used in Larochelle and Murray (2011). We test our proposed stochastic recurrent model trained with the auxiliary cost on a medium-sized IMDB text corpus containing 350K movie reviews (Diao et al., 2014).
Dataset Splits Yes We adopt the same train, validation and test split as in Chung et al. (2015). We split the dataset into train/valid/test sets following these ratios respectively: 85%, 5%, 10%.
Hardware Specification No The paper mentions 'Compute Canada and NVIDIA for computing resources' but does not provide specific hardware details such as exact GPU models (e.g., A100, V100), CPU models, or memory specifications used for experiments.
Software Dependencies No The paper mentions 'Theano' in the acknowledgements as a tool used, but does not provide a specific version number for Theano or any other software dependency relevant for reproducibility.
Experiment Setup Yes In all experiments, we used the ADAM optimizer (Kingma and Ba, 2014). Our forward/backward networks are LSTMs with 2048 recurrent units for Blizzard and 1024 recurrent units for TIMIT. The dimensionality of the Gaussian latent variables is 256. The prior f (p), inference f (q) and auxiliary networks f (a) have a single hidden layer, with 1024 units for Blizzard and 512 units for TIMIT, and use leaky rectified nonlinearities with leakiness 1/3 and clipped at 3 (Fraccaro et al., 2016). For Blizzard, we use a learning rate of 0.0003 and batch size of 128, for TIMIT they are 0.001 and 32 respectively. When KL annealing is used, the temperature was linearly annealed from 0.2 to 1 after each update with increments of 0.00005 (Fraccaro et al., 2016). For MNIST: Both forward and backward networks are LSTMs with one layer of 1024 hidden units. We use a learning rate of 0.001 and batch size of 32. For Language Modeling: We use a single layered LSTM with 500 hidden recurrent units, fix the dimensionality of word embeddings to 300 and use 64 dimensional latent variables. All the f ( ) networks are single-layered with 500 hidden units and leaky relu activations. We used a learning rate of 0.001 and a batch size of 32.