Noisin: Unbiased Regularization for Recurrent Neural Networks

Authors: Adji Bousso Dieng, Rajesh Ranganath, Jaan Altosaar, David Blei

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On language modeling benchmarks, Noisin improves over dropout by as much as 12.2% on the Penn Treebank and 9.4% on the Wikitext-2 dataset. We also compared the state-of-the-art language model of Yang et al. 2017, both with and without Noisin. On the Penn Treebank, the method with Noisin more quickly reaches stateof-the-art performance. Section 5. Empirical Study
Researcher Affiliation Academia 1Columbia University 2New York University 3Princeton University.
Pseudocode Yes Algorithm 1 Noisin with multiplicative noise.
Open Source Code No The models were implemented in Py Torch. The source code is available upon request. (Explanation: The code is stated to be 'available upon request', which does not constitute concrete public access.)
Open Datasets Yes The Penn Treebank portion of the Wall Street Journal (Marcus et al., 1993) is a long standing benchmark dataset for language modeling. We use the standard split, where sections 0 to 20 (930K tokens) are used for training, sections 21 to 22 (74K tokens) for validation, and sections 23 to 24 (82K tokens) for testing (Mikolov et al., 2010). The Wikitext-2 dataset (Merity et al., 2016) has been recently introduced as an alternative to the Penn Treebank dataset.
Dataset Splits Yes We use the standard split, where sections 0 to 20 (930K tokens) are used for training, sections 21 to 22 (74K tokens) for validation, and sections 23 to 24 (82K tokens) for testing (Mikolov et al., 2010).
Hardware Specification No We thank the Princeton Institute for Computational Science and Engineering (PICSci E), the Office of Information Technology s High Performance Computing Center and Visualization Laboratory at Princeton University for the computational resources. (Explanation: This statement mentions computational resources but does not specify any particular hardware models like GPU/CPU types, processor speeds, or memory configurations.)
Software Dependencies No The models were implemented in Py Torch. (Explanation: The paper mentions 'Py Torch' but does not specify a version number or list other software dependencies with their versions.)
Experiment Setup Yes We considered two settings in our experiments: a medium-sized network and a large network. The medium-sized network has 2 layers with 650 hidden units each. ... The large network has 2 layers with 1500 hidden units each. ... We train the models using truncated backpropagation through time... for a maximum of 200 epochs. The LSTM was unrolled for 35 steps. We used a batch size of 80 for both datasets. To avoid the problem of exploding gradients we clip the gradients to a maximum norm of 0.25. We used an initial learning rate of 30 for all experiments. This is divided by a factor of 1.2 if the perplexity on the validation set deteriorates. For the dropout-LSTM, the values used for dropout on the input, recurrent, and output layers were 0.5, 0.4, 0.5 respectively.