Future Language Modeling from Temporal Document History
Authors: Changmao Li, Jeffrey Flanigan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that it is indeed possible to build future language models that improve upon strong non-temporal language model baselines, opening the door to working on this important, and widely applicable problem.1 |
| Researcher Affiliation | Academia | Changmao Li, Jeffrey Flanigan University of California, Santa Cruz {changmao.li,jmflanig}@ucsc.edu |
| Pseudocode | No | The paper provides architectural diagrams of the models in Figure 2, but these are not structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our code is available at https://github.com/jlab-nlp/Future-Language-Modeling |
| Open Datasets | Yes | We first collect paper abstracts for each year from ACL anthology website6 and filter the noisy abstracts such as papers that are not in English. Then we use the years as the year (for other domains such as news, you can use the day or hour as the year) and split the paper abstracts by years and use abstracts from 2003-2019 as training data, the year 2020 as the development data, and the year 2021 as the test data. Footnote 6: https://aclanthology.org/anthology+abstracts.bib.gz |
| Dataset Splits | Yes | Then we use the years as the year (for other domains such as news, you can use the day or hour as the year) and split the paper abstracts by years and use abstracts from 2003-2019 as training data, the year 2020 as the development data, and the year 2021 as the test data. |
| Hardware Specification | Yes | All models were trained or evaluated on either one A40 or A6000 GPU. |
| Software Dependencies | No | The paper mentions software components like 'GPT-2', 'RoBERTa Model', 'Adam optimizer', and 'Huggingface Transformers' but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We use the Adam optimizer (Kingma & Ba, 2015). The batch size is set to 2 with gradient accumulation size 2. Between layers, we apply dropout with a probability of 0.1. We fine-tune 10 epochs for each model and do early stopping. The α is set to 1e 3 or initialized with 1 when automatically learned. Bounds for all hyperparameters are the same as GPT-2. We have several hyperparameter search trials on α which are 1, 1e-1,1e-2, 1e-3, 1e-4, 1e-5. For each model, we have three training and evaluation runs. The method of choosing hyperparameters is based on perplexity scores on the dev set. Fine-tuned Ro BERTa Model (Liu et al., 2019) for each year is used to generate temporal word embedding representation. We use beam search decoding with top-k sampling. The beam size is 5, k is 50, and p is 0.92. |