Mind the Gap: Assessing Temporal Generalization in Neural Language Models
Authors: Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, Phil Blunsom
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To what extent does the current static language modelling practice overestimate performance, compared to the more realistic setup that evaluates LMs on future utterances? To this end, we introduce our dynamic, streaming language modelling benchmarks ( 2), and find that Transformer-XLs (Dai et al., 2019) perform up to 16% worse when predicting articles that are published up to 2 years after the end of the training period. Moreover, model performance becomes increasingly worse with time ( 3). We perform our experiments on autoregressive, left-to-right LMs. |
| Researcher Affiliation | Industry | Angeliki Lazaridou Adhiguna Kuncoro Elena Gribovskaya Devang Agrawal Adam Liška Tayfun Terzi Mai Gimenez Cyprien de Masson d Autume Tomas Kocisky Sebastian Ruder Dani Yogatama Kris Cao Susannah Young Phil Blunsom Deep Mind, London, UK {angeliki,akuncoro,egribovskaya}@deepmind.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | While we do not release the code, the experiment can be repeated with publicly available Transformer implementations. |
| Open Datasets | Yes | For the scientific domain, we use the publicly available ar Xiv abstracts (ARXIV).5 For news, we use the publicly available WMT News Crawl (WMT).5 5Ar Xiv: https://arxiv.org/help/oa/index; WMT News: http://data.statmt.org/ news-crawl; and Sacre Moses: https://github.com/alvations/sacremoses. |
| Dataset Splits | Yes | Here we use all documents from the beginning of each dataset s time period up until September 2017 as training data, and use the last three months of 2017 as our validation period; we denote this as the TIME-STRATIFIED setup. We sample a similarly-sized validation set as the TIME-STRATIFIED setup, which in this case comes from the 2018-2019 evaluation period (again excluding the test documents). |
| Hardware Specification | Yes | To train and evaluate the models, including hyperparameter optimization, we used approximately 186,000 TPU hours. In each experiment, we used 32 TPUs for training and 1 TPU for evaluation. |
| Software Dependencies | No | The paper mentions 'Sentence Piece' and 'Moses' for tokenization, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We use a Transformer-XL (Dai et al., 2019) with 18 layers and 1,024 hidden units, resulting in 287M parameters roughly 15% smaller than GPT-2MEDIUM and BERTLARGE; we later explore larger models in 4. We set the Transformer sequence length to 1,024, and set the memory cache length to 384 during training and 1,600 during test. We use a vocabulary of 50,259 subwords, obtained via Sentence Piece (Kudo and Richardson, 2018) trained on a random subset (up to 15GB) of the training data of each respective experiment, i.e., CONTROL and TIME-STRATIFIED. |