Pointer Sentinel Mixture Models

Authors: Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Utilizing an LSTM that achieves 80.6 perplexity on the Penn Treebank, the pointer sentinel-LSTM model pushes perplexity down to 70.9 while using far fewer parameters than an LSTM that achieves similar results. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and corpora we also introduce the freely available Wiki Text corpus.
Researcher Affiliation Industry Stephen Merity, Caiming Xiong, James Bradbury & Richard Socher Meta Mind A Salesforce Company Palo Alto, CA, USA {smerity,cxiong,james.bradbury,rsocher}@salesforce.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an unambiguous statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes In order to compare our model to the many recent neural language models, we conduct word-level prediction experiments on the Penn Treebank (PTB) dataset (Marcus et al., 1993), pre-processed by Mikolov et al. (2010). The dataset consists of 929k training, 73k validation, and 82k test words. 1Available for download at the Wiki Text dataset site
Dataset Splits Yes The dataset consists of 929k training, 73k validation, and 82k test words.
Hardware Specification Yes Attempts to run the Gal (2015) large model variant, a two layer LSTM with hidden size 1500, resulted in out of memory errors on a 12GB K80 GPU, likely due to the increased vocabulary size.
Software Dependencies No The paper mentions 'Moses tokenizer (Koehn et al., 2007)' but does not provide specific version numbers for key software components or libraries used for the experiments.
Experiment Setup Yes We increased the number of timesteps used during training from 35 to 100, matching the length of the window L. Batch size was increased to 32 from 20. We also halve the learning rate when validation perplexity is worse than the previous iteration, stopping training when validation perplexity fails to improve for three epochs or when 64 epochs are reached. The gradients are rescaled if their global norm exceeds 1 (Pascanu et al., 2013b). We evaluate the medium model configuration which features a two layer LSTM of hidden size 650. We used a value of 0.5 for both dropout connections.