Recurrent Batch Normalization
Authors: Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar Gülçehre, Aaron Courville
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposal on various sequential problems such as sequence classification, language modeling and question answering. Our empirical results show that our batch-normalized LSTM consistently leads to faster convergence and improved generalization. |
| Researcher Affiliation | Academia | Tim Cooijmans, Nicolas Ballas, César Laurent, Ça glar Gülçehre & Aaron Courville MILA Université de Montréal firstname.lastname@umontreal.ca |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the release of their source code. It mentions using Theano, Blocks, and Fuel libraries, but these are third-party tools. |
| Open Datasets | Yes | We evaluate our batch-normalized LSTM on a sequential version of the MNIST classification task (Le et al., 2015). ... We evaluate our model on the task of character-level language modeling on the Penn Treebank corpus (Marcus et al., 1993) according to the train/valid/test partition of Mikolov et al. (2012). ... We evaluate our model on a second character-level language modeling task on the much larger text8 dataset (Mahoney, 2009). ... We evaluate the models on the question answering task using the CNN corpus (Hermann et al., 2015). |
| Dataset Splits | Yes | We evaluate our model on the task of character-level language modeling on the Penn Treebank corpus (Marcus et al., 1993) according to the train/valid/test partition of Mikolov et al. (2012). ... we use the first 90M characters for training, the next 5M for validation and the final 5M characters for testing. |
| Hardware Specification | No | The paper mentions general computing support from "Calcul Québec, Compute Canada" in the acknowledgements, but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions using "Theano (Team et al., 2016) and the Blocks and Fuel (van Merriënboer et al., 2015) libraries for scientific computing" but does not specify version numbers for these software components. |
| Experiment Setup | Yes | Note that for all the experiments, we initialize the batch normalization scale and shift parameters γ and β to 0.1 and 0 respectively. ... The model is trained using RMSProp (Tieleman & Hinton, 2012) with learning rate of 10 3 and 0.9 momentum. We apply gradient clipping at 1 to avoid exploding gradients. ... For the reported performances, the first three models (LSTM, BN-LSTM and BN-everywhere) are trained using the exact same hyperparameters... We use stochastic gradient descent on minibatches of size 64, with gradient clipping at 10 and step rule determined by Adam (Kingma & Ba, 2014) with learning rate 8 10 5. ... Appendix D provides tables with hyperparameter values tried for different tasks, including learning rate, RMSProp momentum, hidden state size, initial γ, and batch size. |