reproducibilityindex.ai

Controlling Global Statistics in Recurrent Neural Network Text Generation

Authors: Thanapon Noraset, David Demeter, Doug Downey

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that the dynamic regularizer outperforms both generic training and a static regularization baseline. The approach is successful at improving wordlevel repetition statistics by a factor of four in RNNLMs on a deﬁnition modeling task. It also improves model perplexity when the statistical constraints are n-gram statistics taken from a large corpus.
Researcher Affiliation	Academia	Thanapon Noraset, David Demeter, Doug Downey Department of Electrical Engineering & Computer Science Northwestern University, Evanston IL 60208, USA {nor, ddemeter}@u.northwestern.edu, d-downey@northwestern.edu
Pseudocode	No	The paper describes the dynamic KL regularization algorithm verbally and with equations, but does not provide a formal pseudocode block or algorithm listing.
Open Source Code	Yes	The implementation is publicly available.2 https://github.com/northanapon/seqmodel/tree/aaai18
Open Datasets	Yes	We begin by providing a comparison between baselines and our regularization on a common benchmark for language modeling, Penn Treebank (PTB).1 Then, we present our results on practical usage of the regularization on two other datasets for language modeling (Wiki Text) (Merity et al. 2016) and deﬁnition modeling (Word Net deﬁnitions) (Miller 1995). 1http://www.ﬁt.vutbr.cz/ imikolov/rnnlm
Dataset Splits	Yes	The models are trained to maximize the likelihood of a training corpus, and evaluated on the likelihood they assign to a held-out test corpus (measured in terms of perplexity).
Hardware Specification	No	The paper does not explicitly describe the hardware used for its experiments. It only mentions 'a 2-layer LSTM with 650 hidden units' and hyperparameters.
Software Dependencies	No	The paper does not explicitly state software dependencies with version numbers. It mentions '2-layer LSTM' and 'stochastic gradient descent' which are methods, not specific software versions.
Experiment Setup	Yes	For language models, we use a 2-layer LSTM with 650 hidden units. The embeddings and output logit weights are tied and have 650 units. We adopt the training hyperparameters from Zaremba et al (2014). Speciﬁcally, we use stochastic gradient descent with the standard dropout rate of 50%. The initial learning rate is 1.0, with a constant decay rate of 0.8 starting at the 6th epoch. Perplexity validation stops signiﬁcantly improving after around 20 epochs. For deﬁnition models, we use the same settings described in (Noraset et al. 2017). As for hyper-parameters for the regularization, we did a limited exploration and use the following settings throughout the experiments. The weight α on the regularization is set to be 1.0 for repetition constraints and 0.5 for bigram constraints. In addition, the log-likelihood ratio in the approximated KL-divergence (Equation 3 and 4) is clipped at -2.0 to 2.0. Finally, text is generated every 100 steps to update the model s marginals every few steps.