Dynamic Evaluation of Neural Sequence Models
Authors: Ben Krause, Emmanuel Kahembwe, Iain Murray, Steve Renals
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply dynamic evaluation to outperform all previous word-level perplexities on the Penn Treebank and Wiki Text-2 datasets (achieving 51.1 and 44.3 respectively) and all previous character-level cross-entropies on the text8 and Hutter Prize datasets (achieving 1.19 bits/char and 1.08 bits/char respectively)." and "7. Experiments We applied dynamic evaluation to wordand character-level language modelling |
| Researcher Affiliation | Academia | Ben Krause 1 Emmanuel Kahembwe 1 Iain Murray 1 Steve Renals 1 1School of Informatics, University of Edinburgh. Correspondence to: Ben Krause <ben.krause@ed.ac.uk>. |
| Pseudocode | No | The paper describes the methodology in text but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1code available at https://github.com/benkrause/ dynamic-evaluation |
| Open Datasets | Yes | We performed word-level language modelling experiments on the Penn Treebank (PTB, Marcus et al., 1993) and Wiki Text-2 (Merity et al., 2017) datasets." and "The Hutter Prize dataset (Hutter, 2006) is comprised of Wikipedia text, including XML and characters from non Latin languages." and "The text8 dataset is derived from the Hutter Prize dataset |
| Dataset Splits | Yes | After training the base model, we tune hyper-parameters for dynamic evaluation on the validation set, and evaluate both the static and dynamic versions of the model on the test set." and "We use the same test set as in Mikolov et al. (2014), but also hold out the final 100k training tokens as a validation set to allow for fair hyper-parameter tuning" and "We used a 90:5:5 split for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types) used for running its experiments. |
| Software Dependencies | No | The paper mentions various software components and models like AWD-LSTM, m LSTM, RMSprop, and ADAM, but does not specify their version numbers or the versions of underlying libraries like Python or PyTorch/TensorFlow. |
| Experiment Setup | No | The paper describes the methodology and hyper-parameter tuning process, including the use of sequence segments of length 5 for word-level tasks and 20 for character-level tasks. It mentions tuning learning rate and decay parameter but does not provide specific values for these hyperparameters or other training configurations. |