reproducibilityindex.ai

Mogrifier LSTM

Authors: Gábor Melis, Tomáš Kočiský, Phil Blunsom

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate markedly improved generalization on language modelling in the range of 3 4 perplexity points on Penn Treebank and Wikitext-2, and 0.01 0.05 bpc on four character-based datasets. We establish a new state of the art on all datasets with the exception of Enwik8, where we close a large gap between the LSTM and Transformer models.
Researcher Affiliation	Collaboration	Gábor Melis , Tomáš Koˇciský , Phil Blunsom {melisgl,tkocisky,pblunsom}@google.com Deep Mind, London, UK University of Oxford
Pseudocode	No	The paper defines the LSTM update and Mogriﬁer equations mathematically (e.g., Equations 1 and 2, and the LSTM function). However, it does not present a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	We make the code and the tuner output available at https://github.com/deepmind/lamb.
Open Datasets	Yes	We compare models on both word and character-level language modelling datasets. The two word-level datasets we picked are the Penn Treebank (PTB) corpus by Marcus et al. (1993) with preprocessing from Mikolov et al. (2010) and Wikitext-2 by Merity et al. (2016)... The ﬁrst character-based corpus is Enwik8 from the Hutter Prize dataset (Hutter 2012)... The ﬁnal character-level dataset is the Multilingual Wikipedia Corpus (MWC, Kawakami et al. (2017))...
Dataset Splits	Yes	Following common practice, we use the ﬁrst 90 million characters for training and the remaining 10 million evenly split between validation and test. For the training set, we generate 500 000 examples by uniformly sampling a given number of tokens from a vocabulary of size 1000. The validation and test sets are constructed similarly, and contain 10 000 examples.
Hardware Specification	No	The paper mentions general aspects like 'changing hardware performance characteristics' but does not specify any particular hardware components such as GPU models, CPU models, or cloud computing instance types used for experiments.
Software Dependencies	No	The paper mentions software components like 'Adam' for optimization and 'Google Vizier' for hyperparameter tuning. However, it does not provide specific version numbers for these or any other software dependencies, which is required for reproducibility.
Experiment Setup	Yes	We tune hyperparameters following the experimental setup of Melis et al. (2018) using a black-box hyperparameter tuner based on batched Gaussian Process Bandits (Golovin et al. 2017). For the LSTM, the tuned hyperparameters are the same: input_embedding_ratio, learning_rate, l2_penalty, input_dropout, inter_layer_dropout, state_dropout, output_dropout. For the Mogriﬁer, the number of rounds r and the rank k of the low-rank approximation is also tuned (allowing for full rank, too). For word-level tasks, BPTT (Werbos et al. 1990) window size is set to 70 and batch size to 64. For character-level tasks, BPTT window size is set to 150 and batch size to 128 except for Enwik8 where the window size is 500. Input and output embeddings are tied for word-level tasks... Optimization is performed with Adam (Kingma and Ba 2014) with β1 = 0, a setting that resembles RMSProp without momentum. Gradients are clipped (Pascanu et al. 2013) to norm 10.