Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
Authors: David Krueger, Tegan Maharaj, Janos Kramar, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, Christopher Pal
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks. We achieve competitive results with relatively simple models in characterand word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization (Cooijmans et al., 2016) yields state-of-the-art results on permuted sequential MNIST. |
| Researcher Affiliation | Academia | 1 MILA, Université de Montréal, firstname.lastname@umontreal.ca. 2 École Polytechnique de Montréal, firstname.lastname@polymtl.ca. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for replicating all experiments can be found at: http://github.com/teganmaharaj/zoneout |
| Open Datasets | Yes | We evaluate zoneout s performance on the following tasks: (1) Character-level language modelling on the Penn Treebank corpus (Marcus et al., 1993); (3) Character-level language modelling on the Text8 corpus (Mahoney, 2011); (4) Classification of hand-written digits on permuted sequential MNIST (p MNIST) (Le et al., 2015). |
| Dataset Splits | No | The paper mentions "Validation BPC" and "Validation error rates" and uses a validation set for metrics, but does not explicitly provide the specific size, percentage, or methodology for the train/validation split. |
| Hardware Specification | No | The paper mentions "computing resources provided by Compute Canada and Calcul Quebec" but does not specify exact hardware details such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions "Theano (Theano Development Team, 2016), Fuel, and Blocks (van Merriënboer et al., 2015)" but does not specify exact version numbers for these software dependencies. |
| Experiment Setup | Yes | For the character-level task, we train networks with one layer of 1000 hidden units. We train LSTMs with a learning rate of 0.002 on overlapping sequences of 100 in batches of 32, optimize using Adam, and clip gradients with threshold 1. ... For the word-level task... 2 layers of 1500 units, with weights initialized uniformly [-0.04, +0.04]. The model is trained for 14 epochs with learning rate 1, after which the learning rate is reduced by a factor of 1.15 after each epoch. Gradient norms are clipped at 10. and "All models have a single layer of 100 units, and are trained for 150 epochs using RMSProp (Tieleman & Hinton, 2012) with a decay rate of 0.5 for the moving average of gradient norms. The learning rate is set to 0.001 and the gradients are clipped to a maximum norm of 1 (Pascanu et al., 2012)." |