Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Authors: David Krueger, Tegan Maharaj, Janos Kramar, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, Christopher Pal

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks. We achieve competitive results with relatively simple models in characterand word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization (Cooijmans et al., 2016) yields state-of-the-art results on permuted sequential MNIST.
Researcher Affiliation Academia 1 MILA, Université de Montréal, firstname.lastname@umontreal.ca. 2 École Polytechnique de Montréal, firstname.lastname@polymtl.ca.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code for replicating all experiments can be found at: http://github.com/teganmaharaj/zoneout
Open Datasets Yes We evaluate zoneout s performance on the following tasks: (1) Character-level language modelling on the Penn Treebank corpus (Marcus et al., 1993); (3) Character-level language modelling on the Text8 corpus (Mahoney, 2011); (4) Classification of hand-written digits on permuted sequential MNIST (p MNIST) (Le et al., 2015).
Dataset Splits No The paper mentions "Validation BPC" and "Validation error rates" and uses a validation set for metrics, but does not explicitly provide the specific size, percentage, or methodology for the train/validation split.
Hardware Specification No The paper mentions "computing resources provided by Compute Canada and Calcul Quebec" but does not specify exact hardware details such as GPU or CPU models.
Software Dependencies No The paper mentions "Theano (Theano Development Team, 2016), Fuel, and Blocks (van Merriënboer et al., 2015)" but does not specify exact version numbers for these software dependencies.
Experiment Setup Yes For the character-level task, we train networks with one layer of 1000 hidden units. We train LSTMs with a learning rate of 0.002 on overlapping sequences of 100 in batches of 32, optimize using Adam, and clip gradients with threshold 1. ... For the word-level task... 2 layers of 1500 units, with weights initialized uniformly [-0.04, +0.04]. The model is trained for 14 epochs with learning rate 1, after which the learning rate is reduced by a factor of 1.15 after each epoch. Gradient norms are clipped at 10. and "All models have a single layer of 100 units, and are trained for 150 epochs using RMSProp (Tieleman & Hinton, 2012) with a decay rate of 0.5 for the moving average of gradient norms. The learning rate is set to 0.001 and the gradients are clipped to a maximum norm of 1 (Pascanu et al., 2012)."