Improving Neural Language Models with a Continuous Cache

Authors: Edouard Grave, Armand Joulin, Nicolas Usunier

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.
Researcher Affiliation Industry Edouard Grave, Armand Joulin, Nicolas Usunier Facebook AI Research {egrave,ajoulin,usunier}@fb.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Model descriptions are presented in prose and mathematical equations.
Open Source Code No The paper does not contain any statement about releasing source code for the methodology described, nor does it provide a link to a repository.
Open Datasets Yes Datasets. In this section, we describe experiments performed on two small datasets: the Penn Tree Bank (Marcus et al., 1993) and the wikitext2 (Merity et al., 2016) datasets. The Penn Tree Bank dataset is made of articles from the Wall Street Journal, contains 929k training tokens and has a vocabulary size of 10k. The wikitext2 dataset is derived from Wikipedia articles, contains 2M training tokens and has a vocabulary size of 33k. [...] In this section, we describe experiments performed over two medium scale datasets: text8 and wikitext103. Both datasets are derived from Wikipedia, but different pre-processing were applied. The text8 dataset contains 17M training tokens and has a vocabulary size of 44k words, while the wikitext103 dataset has a training set of size 103M, and a vocabulary size of 267k words. [...] Finally, we report experiments carried on the lambada dataset, introduced by Paperno et al. (2016).
Dataset Splits Yes We report the perplexity on the validation sets in Figures 2 and 3, for various values of hyperparameters, for linear interpolation and global normalization. [...] We consider cache sizes on a logarithmic scale, from 50 to 10, 000, and fit the cache hyperparameters on the validation set.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions software components like 'Adagrad algorithm' and 'adaptive softmax (Grave et al., 2016)' but does not provide specific version numbers for any libraries, frameworks, or programming languages used.
Experiment Setup Yes We train recurrent neural network language models with 1024 LSTM units, regularized with dropout (probability of dropping out units equals to 0.65). We use the Adagrad algorithm, with a learning rate of 0.2, a batchsize of 20 and initial weight uniformly sampled in the range [ 0.05, 0.05]. We clip the norm of the gradient to 0.1 and unroll the network for 30 steps. [...] We use the same setting as in the previous section, except for the batchsize (we use 128) and dropout parameters (we use 0.45 for text8 and 0.25 for wikitext103).