Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Noisin: Unbiased Regularization for Recurrent Neural Networks
Authors: Adji Bousso Dieng, Rajesh Ranganath, Jaan Altosaar, David Blei
ICML 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On language modeling benchmarks, Noisin improves over dropout by as much as 12.2% on the Penn Treebank and 9.4% on the Wikitext-2 dataset. We also compared the state-of-the-art language model of Yang et al. 2017, both with and without Noisin. On the Penn Treebank, the method with Noisin more quickly reaches stateof-the-art performance. Section 5. Empirical Study |
| Researcher Affiliation | Academia | 1Columbia University 2New York University 3Princeton University. |
| Pseudocode | Yes | Algorithm 1 Noisin with multiplicative noise. |
| Open Source Code | No | The models were implemented in Py Torch. The source code is available upon request. (Explanation: The code is stated to be 'available upon request', which does not constitute concrete public access.) |
| Open Datasets | Yes | The Penn Treebank portion of the Wall Street Journal (Marcus et al., 1993) is a long standing benchmark dataset for language modeling. We use the standard split, where sections 0 to 20 (930K tokens) are used for training, sections 21 to 22 (74K tokens) for validation, and sections 23 to 24 (82K tokens) for testing (Mikolov et al., 2010). The Wikitext-2 dataset (Merity et al., 2016) has been recently introduced as an alternative to the Penn Treebank dataset. |
| Dataset Splits | Yes | We use the standard split, where sections 0 to 20 (930K tokens) are used for training, sections 21 to 22 (74K tokens) for validation, and sections 23 to 24 (82K tokens) for testing (Mikolov et al., 2010). |
| Hardware Specification | No | We thank the Princeton Institute for Computational Science and Engineering (PICSci E), the Office of Information Technology s High Performance Computing Center and Visualization Laboratory at Princeton University for the computational resources. (Explanation: This statement mentions computational resources but does not specify any particular hardware models like GPU/CPU types, processor speeds, or memory configurations.) |
| Software Dependencies | No | The models were implemented in Py Torch. (Explanation: The paper mentions 'Py Torch' but does not specify a version number or list other software dependencies with their versions.) |
| Experiment Setup | Yes | We considered two settings in our experiments: a medium-sized network and a large network. The medium-sized network has 2 layers with 650 hidden units each. ... The large network has 2 layers with 1500 hidden units each. ... We train the models using truncated backpropagation through time... for a maximum of 200 epochs. The LSTM was unrolled for 35 steps. We used a batch size of 80 for both datasets. To avoid the problem of exploding gradients we clip the gradients to a maximum norm of 0.25. We used an initial learning rate of 30 for all experiments. This is divided by a factor of 1.2 if the perplexity on the validation set deteriorates. For the dropout-LSTM, the values used for dropout on the input, recurrent, and output layers were 0.5, 0.4, 0.5 respectively. |