Controlling Global Statistics in Recurrent Neural Network Text Generation
Authors: Thanapon Noraset, David Demeter, Doug Downey
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that the dynamic regularizer outperforms both generic training and a static regularization baseline. The approach is successful at improving wordlevel repetition statistics by a factor of four in RNNLMs on a definition modeling task. It also improves model perplexity when the statistical constraints are n-gram statistics taken from a large corpus. |
| Researcher Affiliation | Academia | Thanapon Noraset, David Demeter, Doug Downey Department of Electrical Engineering & Computer Science Northwestern University, Evanston IL 60208, USA {nor, ddemeter}@u.northwestern.edu, d-downey@northwestern.edu |
| Pseudocode | No | The paper describes the dynamic KL regularization algorithm verbally and with equations, but does not provide a formal pseudocode block or algorithm listing. |
| Open Source Code | Yes | The implementation is publicly available.2 https://github.com/northanapon/seqmodel/tree/aaai18 |
| Open Datasets | Yes | We begin by providing a comparison between baselines and our regularization on a common benchmark for language modeling, Penn Treebank (PTB).1 Then, we present our results on practical usage of the regularization on two other datasets for language modeling (Wiki Text) (Merity et al. 2016) and definition modeling (Word Net definitions) (Miller 1995). 1http://www.fit.vutbr.cz/ imikolov/rnnlm |
| Dataset Splits | Yes | The models are trained to maximize the likelihood of a training corpus, and evaluated on the likelihood they assign to a held-out test corpus (measured in terms of perplexity). |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for its experiments. It only mentions 'a 2-layer LSTM with 650 hidden units' and hyperparameters. |
| Software Dependencies | No | The paper does not explicitly state software dependencies with version numbers. It mentions '2-layer LSTM' and 'stochastic gradient descent' which are methods, not specific software versions. |
| Experiment Setup | Yes | For language models, we use a 2-layer LSTM with 650 hidden units. The embeddings and output logit weights are tied and have 650 units. We adopt the training hyperparameters from Zaremba et al (2014). Specifically, we use stochastic gradient descent with the standard dropout rate of 50%. The initial learning rate is 1.0, with a constant decay rate of 0.8 starting at the 6th epoch. Perplexity validation stops significantly improving after around 20 epochs. For definition models, we use the same settings described in (Noraset et al. 2017). As for hyper-parameters for the regularization, we did a limited exploration and use the following settings throughout the experiments. The weight α on the regularization is set to be 1.0 for repetition constraints and 0.5 for bigram constraints. In addition, the log-likelihood ratio in the approximated KL-divergence (Equation 3 and 4) is clipped at -2.0 to 2.0. Finally, text is generated every 100 steps to update the model s marginals every few steps. |