Dynamic Evaluation of Neural Sequence Models

Authors: Ben Krause, Emmanuel Kahembwe, Iain Murray, Steve Renals

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply dynamic evaluation to outperform all previous word-level perplexities on the Penn Treebank and Wiki Text-2 datasets (achieving 51.1 and 44.3 respectively) and all previous character-level cross-entropies on the text8 and Hutter Prize datasets (achieving 1.19 bits/char and 1.08 bits/char respectively)." and "7. Experiments We applied dynamic evaluation to wordand character-level language modelling
Researcher Affiliation Academia Ben Krause 1 Emmanuel Kahembwe 1 Iain Murray 1 Steve Renals 1 1School of Informatics, University of Edinburgh. Correspondence to: Ben Krause <ben.krause@ed.ac.uk>.
Pseudocode No The paper describes the methodology in text but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes 1code available at https://github.com/benkrause/ dynamic-evaluation
Open Datasets Yes We performed word-level language modelling experiments on the Penn Treebank (PTB, Marcus et al., 1993) and Wiki Text-2 (Merity et al., 2017) datasets." and "The Hutter Prize dataset (Hutter, 2006) is comprised of Wikipedia text, including XML and characters from non Latin languages." and "The text8 dataset is derived from the Hutter Prize dataset
Dataset Splits Yes After training the base model, we tune hyper-parameters for dynamic evaluation on the validation set, and evaluate both the static and dynamic versions of the model on the test set." and "We use the same test set as in Mikolov et al. (2014), but also hold out the final 100k training tokens as a validation set to allow for fair hyper-parameter tuning" and "We used a 90:5:5 split for training, validation, and testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types) used for running its experiments.
Software Dependencies No The paper mentions various software components and models like AWD-LSTM, m LSTM, RMSprop, and ADAM, but does not specify their version numbers or the versions of underlying libraries like Python or PyTorch/TensorFlow.
Experiment Setup No The paper describes the methodology and hyper-parameter tuning process, including the use of sequence segments of length 5 for word-level tasks and 20 for character-level tasks. It mentions tuning learning rate and decay parameter but does not provide specific values for these hyperparameters or other training configurations.