Future Language Modeling from Temporal Document History

Authors: Changmao Li, Jeffrey Flanigan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that it is indeed possible to build future language models that improve upon strong non-temporal language model baselines, opening the door to working on this important, and widely applicable problem.1
Researcher Affiliation Academia Changmao Li, Jeffrey Flanigan University of California, Santa Cruz {changmao.li,jmflanig}@ucsc.edu
Pseudocode No The paper provides architectural diagrams of the models in Figure 2, but these are not structured pseudocode or algorithm blocks.
Open Source Code Yes 1Our code is available at https://github.com/jlab-nlp/Future-Language-Modeling
Open Datasets Yes We first collect paper abstracts for each year from ACL anthology website6 and filter the noisy abstracts such as papers that are not in English. Then we use the years as the year (for other domains such as news, you can use the day or hour as the year) and split the paper abstracts by years and use abstracts from 2003-2019 as training data, the year 2020 as the development data, and the year 2021 as the test data. Footnote 6: https://aclanthology.org/anthology+abstracts.bib.gz
Dataset Splits Yes Then we use the years as the year (for other domains such as news, you can use the day or hour as the year) and split the paper abstracts by years and use abstracts from 2003-2019 as training data, the year 2020 as the development data, and the year 2021 as the test data.
Hardware Specification Yes All models were trained or evaluated on either one A40 or A6000 GPU.
Software Dependencies No The paper mentions software components like 'GPT-2', 'RoBERTa Model', 'Adam optimizer', and 'Huggingface Transformers' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We use the Adam optimizer (Kingma & Ba, 2015). The batch size is set to 2 with gradient accumulation size 2. Between layers, we apply dropout with a probability of 0.1. We fine-tune 10 epochs for each model and do early stopping. The α is set to 1e 3 or initialized with 1 when automatically learned. Bounds for all hyperparameters are the same as GPT-2. We have several hyperparameter search trials on α which are 1, 1e-1,1e-2, 1e-3, 1e-4, 1e-5. For each model, we have three training and evaluation runs. The method of choosing hyperparameters is based on perplexity scores on the dev set. Fine-tuned Ro BERTa Model (Liu et al., 2019) for each year is used to generate temporal word embedding representation. We use beam search decoding with top-k sampling. The beam size is 5, k is 50, and p is 0.92.