Pointer Sentinel Mixture Models
Authors: Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Utilizing an LSTM that achieves 80.6 perplexity on the Penn Treebank, the pointer sentinel-LSTM model pushes perplexity down to 70.9 while using far fewer parameters than an LSTM that achieves similar results. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and corpora we also introduce the freely available Wiki Text corpus. |
| Researcher Affiliation | Industry | Stephen Merity, Caiming Xiong, James Bradbury & Richard Socher Meta Mind A Salesforce Company Palo Alto, CA, USA {smerity,cxiong,james.bradbury,rsocher}@salesforce.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include an unambiguous statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | In order to compare our model to the many recent neural language models, we conduct word-level prediction experiments on the Penn Treebank (PTB) dataset (Marcus et al., 1993), pre-processed by Mikolov et al. (2010). The dataset consists of 929k training, 73k validation, and 82k test words. 1Available for download at the Wiki Text dataset site |
| Dataset Splits | Yes | The dataset consists of 929k training, 73k validation, and 82k test words. |
| Hardware Specification | Yes | Attempts to run the Gal (2015) large model variant, a two layer LSTM with hidden size 1500, resulted in out of memory errors on a 12GB K80 GPU, likely due to the increased vocabulary size. |
| Software Dependencies | No | The paper mentions 'Moses tokenizer (Koehn et al., 2007)' but does not provide specific version numbers for key software components or libraries used for the experiments. |
| Experiment Setup | Yes | We increased the number of timesteps used during training from 35 to 100, matching the length of the window L. Batch size was increased to 32 from 20. We also halve the learning rate when validation perplexity is worse than the previous iteration, stopping training when validation perplexity fails to improve for three epochs or when 64 epochs are reached. The gradients are rescaled if their global norm exceeds 1 (Pascanu et al., 2013b). We evaluate the medium model configuration which features a two layer LSTM of hidden size 650. We used a value of 0.5 for both dropout connections. |