Noisin: Unbiased Regularization for Recurrent Neural Networks
Authors: Adji Bousso Dieng, Rajesh Ranganath, Jaan Altosaar, David Blei
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On language modeling benchmarks, Noisin improves over dropout by as much as 12.2% on the Penn Treebank and 9.4% on the Wikitext-2 dataset. We also compared the state-of-the-art language model of Yang et al. 2017, both with and without Noisin. On the Penn Treebank, the method with Noisin more quickly reaches stateof-the-art performance. Section 5. Empirical Study |
| Researcher Affiliation | Academia | 1Columbia University 2New York University 3Princeton University. |
| Pseudocode | Yes | Algorithm 1 Noisin with multiplicative noise. |
| Open Source Code | No | The models were implemented in Py Torch. The source code is available upon request. (Explanation: The code is stated to be 'available upon request', which does not constitute concrete public access.) |
| Open Datasets | Yes | The Penn Treebank portion of the Wall Street Journal (Marcus et al., 1993) is a long standing benchmark dataset for language modeling. We use the standard split, where sections 0 to 20 (930K tokens) are used for training, sections 21 to 22 (74K tokens) for validation, and sections 23 to 24 (82K tokens) for testing (Mikolov et al., 2010). The Wikitext-2 dataset (Merity et al., 2016) has been recently introduced as an alternative to the Penn Treebank dataset. |
| Dataset Splits | Yes | We use the standard split, where sections 0 to 20 (930K tokens) are used for training, sections 21 to 22 (74K tokens) for validation, and sections 23 to 24 (82K tokens) for testing (Mikolov et al., 2010). |
| Hardware Specification | No | We thank the Princeton Institute for Computational Science and Engineering (PICSci E), the Office of Information Technology s High Performance Computing Center and Visualization Laboratory at Princeton University for the computational resources. (Explanation: This statement mentions computational resources but does not specify any particular hardware models like GPU/CPU types, processor speeds, or memory configurations.) |
| Software Dependencies | No | The models were implemented in Py Torch. (Explanation: The paper mentions 'Py Torch' but does not specify a version number or list other software dependencies with their versions.) |
| Experiment Setup | Yes | We considered two settings in our experiments: a medium-sized network and a large network. The medium-sized network has 2 layers with 650 hidden units each. ... The large network has 2 layers with 1500 hidden units each. ... We train the models using truncated backpropagation through time... for a maximum of 200 epochs. The LSTM was unrolled for 35 steps. We used a batch size of 80 for both datasets. To avoid the problem of exploding gradients we clip the gradients to a maximum norm of 0.25. We used an initial learning rate of 30 for all experiments. This is divided by a factor of 1.2 if the perplexity on the validation set deteriorates. For the dropout-LSTM, the values used for dropout on the input, recurrent, and output layers were 0.5, 0.4, 0.5 respectively. |