On the State of the Art of Evaluation in Neural Language Models

Authors: Gábor Melis, Chris Dyer, Phil Blunsom

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models. We establish a new state of the art on the Penn Treebank and Wikitext-2 corpora, as well as strong baselines on the Hutter Prize dataset.
Researcher Affiliation Collaboration G abor Melis , Chris Dyer , Phil Blunsom {melisgl,cdyer,pblunsom}@google.com Deep Mind University of Oxford
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We compare models on three datasets. The smallest of them is the Penn Treebank corpus by Marcus et al. (1993) with preprocessing from Mikolov et al. (2010). We also include another word level corpus: Wikitext-2 by Merity et al. (2016). It is about twice the size of Penn Treebank with a larger vocabulary and much lighter preprocessing. The third corpus is Enwik8 from the Hutter Prize dataset (Hutter, 2012).
Dataset Splits Yes For Enwik8... we use the first 90 million characters for training, and the remaining 10 million evenly split between validation and test. For evaluation, the checkpoint with the best validation perplexity found by the tuner is loaded and the model is applied to the test set...
Hardware Specification No The paper mentions running experiments 'with a single GPU' but does not specify the exact model or other hardware details such as CPU or memory.
Software Dependencies No The paper mentions using TensorFlow but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup Yes When training word level models we follow common practice and use a batch size of 64, truncated backpropagation with 35 time steps... Optimisation is performed by Adam (Kingma & Ba, 2014) with β1 = 0 but otherwise default parameters (β2 = 0.999, ϵ = 10 9)... For character level models... truncated backpropagation is performed with 50 time steps. Adam s parameters are β2 = 0.99, ϵ = 10 5. Batch size is 128... Hyperparameters are optimised by Google Vizier... we restrict the set of hyperparameters to learning rate, input embedding ratio, input dropout, state dropout, output dropout, weight decay.