reproducibilityindex.ai

Mind the Gap: Assessing Temporal Generalization in Neural Language Models

Authors: Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, Phil Blunsom

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To what extent does the current static language modelling practice overestimate performance, compared to the more realistic setup that evaluates LMs on future utterances? To this end, we introduce our dynamic, streaming language modelling benchmarks ( 2), and find that Transformer-XLs (Dai et al., 2019) perform up to 16% worse when predicting articles that are published up to 2 years after the end of the training period. Moreover, model performance becomes increasingly worse with time ( 3). We perform our experiments on autoregressive, left-to-right LMs.
Researcher Affiliation	Industry	Angeliki Lazaridou Adhiguna Kuncoro Elena Gribovskaya Devang Agrawal Adam Liška Tayfun Terzi Mai Gimenez Cyprien de Masson d Autume Tomas Kocisky Sebastian Ruder Dani Yogatama Kris Cao Susannah Young Phil Blunsom Deep Mind, London, UK {angeliki,akuncoro,egribovskaya}@deepmind.com
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	While we do not release the code, the experiment can be repeated with publicly available Transformer implementations.
Open Datasets	Yes	For the scientific domain, we use the publicly available ar Xiv abstracts (ARXIV).5 For news, we use the publicly available WMT News Crawl (WMT).5 5Ar Xiv: https://arxiv.org/help/oa/index; WMT News: http://data.statmt.org/ news-crawl; and Sacre Moses: https://github.com/alvations/sacremoses.
Dataset Splits	Yes	Here we use all documents from the beginning of each dataset s time period up until September 2017 as training data, and use the last three months of 2017 as our validation period; we denote this as the TIME-STRATIFIED setup. We sample a similarly-sized validation set as the TIME-STRATIFIED setup, which in this case comes from the 2018-2019 evaluation period (again excluding the test documents).
Hardware Specification	Yes	To train and evaluate the models, including hyperparameter optimization, we used approximately 186,000 TPU hours. In each experiment, we used 32 TPUs for training and 1 TPU for evaluation.
Software Dependencies	No	The paper mentions 'Sentence Piece' and 'Moses' for tokenization, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We use a Transformer-XL (Dai et al., 2019) with 18 layers and 1,024 hidden units, resulting in 287M parameters roughly 15% smaller than GPT-2MEDIUM and BERTLARGE; we later explore larger models in 4. We set the Transformer sequence length to 1,024, and set the memory cache length to 384 during training and 1,600 during test. We use a vocabulary of 50,259 subwords, obtained via Sentence Piece (Kudo and Richardson, 2018) trained on a random subset (up to 15GB) of the training data of each respective experiment, i.e., CONTROL and TIME-STRATIFIED.