reproducibilityindex.ai

Data Noising as Smoothing in Neural Network Language Models

Authors: Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, Andrew Y. Ng

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate performance gains when applying the proposed schemes to language modeling and machine translation. Finally, we provide empirical analysis validating the relationship between noising and smoothing. 4 EXPERIMENTS
Researcher Affiliation	Academia	Computer Science Department, Stanford University {zxie,sidaw,danilevy,anie,ang}@cs.stanford.edu, {jiweil,jurafsky}@stanford.edu
Pseudocode	Yes	A SKETCH OF NOISING ALGORITHM Appendix A. Algorithm 1 Bigram KN noising (Language modeling setting)
Open Source Code	Yes	Code will be made available at: http://deeplearning.stanford.edu/noising
Open Datasets	Yes	We train networks for word-level language modeling on the Penn Treebank dataset, using the standard preprocessed splits with a 10K size vocabulary (Mikolov, 2012). The PTB dataset contains 929k training tokens, 73k validation tokens, and 82k test tokens.
Dataset Splits	Yes	The PTB dataset contains 929k training tokens, 73k validation tokens, and 82k test tokens. The ﬁrst 90M characters are used for training, the next 5M for validation, and the ﬁnal 5M for testing, resulting in 15.3M training tokens, 848K validation tokens, and 855K test tokens.
Hardware Specification	No	The paper states: "Some GPUs used in this work were donated by NVIDIA Corporation." However, it does not provide specific models or configurations for the GPUs, CPUs, or other hardware components used in the experiments.
Software Dependencies	No	The paper mentions: "We also thank the developers of Theano (Theano Development Team, 2016) and Tensorﬂow (Abadi et al., 2016)." While it names the software, it does not specify version numbers for Theano or TensorFlow, or any other libraries or dependencies.
Experiment Setup	Yes	Following Zaremba et al. (2014), we use minibatches of size 20 and unroll for 35 time steps when performing backpropagation through time. All models have two hidden layers and use LSTM units. Weights are initialized uniformly in the range [ 0.1, 0.1]. We train using stochastic gradient descent with an initial learning rate of 1.0, clipping the gradient if its norm exceeds 5.0. When the validation cross entropy does not decrease after a training epoch, we halve the learning rate. We anneal the learning rate 8 times before stopping training... We set hidden unit dropout rate to 0.2 across all settings as suggested in Luong et al. (2015).