Data Noising as Smoothing in Neural Network Language Models
Authors: Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, Andrew Y. Ng
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate performance gains when applying the proposed schemes to language modeling and machine translation. Finally, we provide empirical analysis validating the relationship between noising and smoothing. 4 EXPERIMENTS |
| Researcher Affiliation | Academia | Computer Science Department, Stanford University {zxie,sidaw,danilevy,anie,ang}@cs.stanford.edu, {jiweil,jurafsky}@stanford.edu |
| Pseudocode | Yes | A SKETCH OF NOISING ALGORITHM Appendix A. Algorithm 1 Bigram KN noising (Language modeling setting) |
| Open Source Code | Yes | Code will be made available at: http://deeplearning.stanford.edu/noising |
| Open Datasets | Yes | We train networks for word-level language modeling on the Penn Treebank dataset, using the standard preprocessed splits with a 10K size vocabulary (Mikolov, 2012). The PTB dataset contains 929k training tokens, 73k validation tokens, and 82k test tokens. |
| Dataset Splits | Yes | The PTB dataset contains 929k training tokens, 73k validation tokens, and 82k test tokens. The first 90M characters are used for training, the next 5M for validation, and the final 5M for testing, resulting in 15.3M training tokens, 848K validation tokens, and 855K test tokens. |
| Hardware Specification | No | The paper states: "Some GPUs used in this work were donated by NVIDIA Corporation." However, it does not provide specific models or configurations for the GPUs, CPUs, or other hardware components used in the experiments. |
| Software Dependencies | No | The paper mentions: "We also thank the developers of Theano (Theano Development Team, 2016) and Tensorflow (Abadi et al., 2016)." While it names the software, it does not specify version numbers for Theano or TensorFlow, or any other libraries or dependencies. |
| Experiment Setup | Yes | Following Zaremba et al. (2014), we use minibatches of size 20 and unroll for 35 time steps when performing backpropagation through time. All models have two hidden layers and use LSTM units. Weights are initialized uniformly in the range [ 0.1, 0.1]. We train using stochastic gradient descent with an initial learning rate of 1.0, clipping the gradient if its norm exceeds 5.0. When the validation cross entropy does not decrease after a training epoch, we halve the learning rate. We anneal the learning rate 8 times before stopping training... We set hidden unit dropout rate to 0.2 across all settings as suggested in Luong et al. (2015). |