Mind the Gap: Assessing Temporal Generalization in Neural Language Models

Authors: Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, Phil Blunsom

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To what extent does the current static language modelling practice overestimate performance, compared to the more realistic setup that evaluates LMs on future utterances? To this end, we introduce our dynamic, streaming language modelling benchmarks ( 2), and find that Transformer-XLs (Dai et al., 2019) perform up to 16% worse when predicting articles that are published up to 2 years after the end of the training period. Moreover, model performance becomes increasingly worse with time ( 3). We perform our experiments on autoregressive, left-to-right LMs.
Researcher Affiliation Industry Angeliki Lazaridou Adhiguna Kuncoro Elena Gribovskaya Devang Agrawal Adam Liška Tayfun Terzi Mai Gimenez Cyprien de Masson d Autume Tomas Kocisky Sebastian Ruder Dani Yogatama Kris Cao Susannah Young Phil Blunsom Deep Mind, London, UK {angeliki,akuncoro,egribovskaya}@deepmind.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No While we do not release the code, the experiment can be repeated with publicly available Transformer implementations.
Open Datasets Yes For the scientific domain, we use the publicly available ar Xiv abstracts (ARXIV).5 For news, we use the publicly available WMT News Crawl (WMT).5 5Ar Xiv: https://arxiv.org/help/oa/index; WMT News: http://data.statmt.org/ news-crawl; and Sacre Moses: https://github.com/alvations/sacremoses.
Dataset Splits Yes Here we use all documents from the beginning of each dataset s time period up until September 2017 as training data, and use the last three months of 2017 as our validation period; we denote this as the TIME-STRATIFIED setup. We sample a similarly-sized validation set as the TIME-STRATIFIED setup, which in this case comes from the 2018-2019 evaluation period (again excluding the test documents).
Hardware Specification Yes To train and evaluate the models, including hyperparameter optimization, we used approximately 186,000 TPU hours. In each experiment, we used 32 TPUs for training and 1 TPU for evaluation.
Software Dependencies No The paper mentions 'Sentence Piece' and 'Moses' for tokenization, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We use a Transformer-XL (Dai et al., 2019) with 18 layers and 1,024 hidden units, resulting in 287M parameters roughly 15% smaller than GPT-2MEDIUM and BERTLARGE; we later explore larger models in 4. We set the Transformer sequence length to 1,024, and set the memory cache length to 384 during training and 1,600 during test. We use a vocabulary of 50,259 subwords, obtained via Sentence Piece (Kudo and Richardson, 2018) trained on a random subset (up to 15GB) of the training data of each respective experiment, i.e., CONTROL and TIME-STRATIFIED.