Mogrifier LSTM
Authors: Gábor Melis, Tomáš Kočiský, Phil Blunsom
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate markedly improved generalization on language modelling in the range of 3 4 perplexity points on Penn Treebank and Wikitext-2, and 0.01 0.05 bpc on four character-based datasets. We establish a new state of the art on all datasets with the exception of Enwik8, where we close a large gap between the LSTM and Transformer models. |
| Researcher Affiliation | Collaboration | Gábor Melis , Tomáš Koˇciský , Phil Blunsom {melisgl,tkocisky,pblunsom}@google.com Deep Mind, London, UK University of Oxford |
| Pseudocode | No | The paper defines the LSTM update and Mogrifier equations mathematically (e.g., Equations 1 and 2, and the LSTM function). However, it does not present a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | We make the code and the tuner output available at https://github.com/deepmind/lamb. |
| Open Datasets | Yes | We compare models on both word and character-level language modelling datasets. The two word-level datasets we picked are the Penn Treebank (PTB) corpus by Marcus et al. (1993) with preprocessing from Mikolov et al. (2010) and Wikitext-2 by Merity et al. (2016)... The first character-based corpus is Enwik8 from the Hutter Prize dataset (Hutter 2012)... The final character-level dataset is the Multilingual Wikipedia Corpus (MWC, Kawakami et al. (2017))... |
| Dataset Splits | Yes | Following common practice, we use the first 90 million characters for training and the remaining 10 million evenly split between validation and test. For the training set, we generate 500 000 examples by uniformly sampling a given number of tokens from a vocabulary of size 1000. The validation and test sets are constructed similarly, and contain 10 000 examples. |
| Hardware Specification | No | The paper mentions general aspects like 'changing hardware performance characteristics' but does not specify any particular hardware components such as GPU models, CPU models, or cloud computing instance types used for experiments. |
| Software Dependencies | No | The paper mentions software components like 'Adam' for optimization and 'Google Vizier' for hyperparameter tuning. However, it does not provide specific version numbers for these or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We tune hyperparameters following the experimental setup of Melis et al. (2018) using a black-box hyperparameter tuner based on batched Gaussian Process Bandits (Golovin et al. 2017). For the LSTM, the tuned hyperparameters are the same: input_embedding_ratio, learning_rate, l2_penalty, input_dropout, inter_layer_dropout, state_dropout, output_dropout. For the Mogrifier, the number of rounds r and the rank k of the low-rank approximation is also tuned (allowing for full rank, too). For word-level tasks, BPTT (Werbos et al. 1990) window size is set to 70 and batch size to 64. For character-level tasks, BPTT window size is set to 150 and batch size to 128 except for Enwik8 where the window size is 500. Input and output embeddings are tied for word-level tasks... Optimization is performed with Adam (Kingma and Ba 2014) with β1 = 0, a setting that resembles RMSProp without momentum. Gradients are clipped (Pascanu et al. 2013) to norm 10. |