Data Noising as Smoothing in Neural Network Language Models

Authors: Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, Andrew Y. Ng

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate performance gains when applying the proposed schemes to language modeling and machine translation. Finally, we provide empirical analysis validating the relationship between noising and smoothing. 4 EXPERIMENTS
Researcher Affiliation Academia Computer Science Department, Stanford University {zxie,sidaw,danilevy,anie,ang}@cs.stanford.edu, {jiweil,jurafsky}@stanford.edu
Pseudocode Yes A SKETCH OF NOISING ALGORITHM Appendix A. Algorithm 1 Bigram KN noising (Language modeling setting)
Open Source Code Yes Code will be made available at: http://deeplearning.stanford.edu/noising
Open Datasets Yes We train networks for word-level language modeling on the Penn Treebank dataset, using the standard preprocessed splits with a 10K size vocabulary (Mikolov, 2012). The PTB dataset contains 929k training tokens, 73k validation tokens, and 82k test tokens.
Dataset Splits Yes The PTB dataset contains 929k training tokens, 73k validation tokens, and 82k test tokens. The first 90M characters are used for training, the next 5M for validation, and the final 5M for testing, resulting in 15.3M training tokens, 848K validation tokens, and 855K test tokens.
Hardware Specification No The paper states: "Some GPUs used in this work were donated by NVIDIA Corporation." However, it does not provide specific models or configurations for the GPUs, CPUs, or other hardware components used in the experiments.
Software Dependencies No The paper mentions: "We also thank the developers of Theano (Theano Development Team, 2016) and Tensorflow (Abadi et al., 2016)." While it names the software, it does not specify version numbers for Theano or TensorFlow, or any other libraries or dependencies.
Experiment Setup Yes Following Zaremba et al. (2014), we use minibatches of size 20 and unroll for 35 time steps when performing backpropagation through time. All models have two hidden layers and use LSTM units. Weights are initialized uniformly in the range [ 0.1, 0.1]. We train using stochastic gradient descent with an initial learning rate of 1.0, clipping the gradient if its norm exceeds 5.0. When the validation cross entropy does not decrease after a training epoch, we halve the learning rate. We anneal the learning rate 8 times before stopping training... We set hidden unit dropout rate to 0.2 across all settings as suggested in Luong et al. (2015).