Unlimiformer: Long-Range Transformers with Unlimited Length Input
Authors: Amanda Bertsch, Uri Alon, Graham Neubig, Matthew Gormley
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the Book Sum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART (Lewis et al., 2020a) and Longformer (Beltagy et al., 2020) by extending them to unlimited inputs without additional learned weights and without modifying their code. |
| Researcher Affiliation | Academia | Amanda Bertsch Uri Alon Graham Neubig Matthew R. Gormley Carnegie Mellon University, USA {abertsch,ualon,gneubig,mgormley}@cs.cmu.edu |
| Pseudocode | No | The paper describes the methods and procedures in narrative text and with mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and models are publicly available, and support LLa MA-2 as well. https://github.com/abertsch72/unlimiformer. Toward this end, we release our code at https: //github.com/abertsch72/unlimiformer. |
| Open Datasets | Yes | We experiment with two long-documentand one book-summarization datasets from varying domains. Table 2 summarizes statistics for each dataset. Gov Report and Summ Screen were taken from the SCROLLS benchmark (Shaham et al., 2022). Gov Report (Huang et al., 2021) is a long-document summarization dataset... Summ Screen (Chen et al., 2022) is a long-document summarization dataset... Book Sum (Kry sci nski et al., 2021) is a book-summarization dataset... |
| Dataset Splits | Yes | Appendix B: We trained on 10,000 randomly selected examples from this version of Wiki Sum and evaluate on 2,000 randomly sampled examples (1,000 validation, 1,000 test), maintaining the same sample across all experiments. |
| Hardware Specification | Yes | Appendix D, Computational Cost: We estimate the total GPU time for results presented in this paper did not exceed approximately 116 days of time on a single 48-GB A6000. |
| Software Dependencies | No | The paper states: "Our code is based on Hugging Face Transformers (Wolf et al., 2020)", but it does not specify the version number for Hugging Face Transformers or any other software dependencies. |
| Experiment Setup | Yes | Appendix A, Training details: At training time, we must backpropagate through the operations described above. Thus, the input length is bounded more strictly the number of tokens in the full input must fit in GPU memory while the model is loaded. For the computationally expensive methods, we train using batch size 1 and truncate the longest inputs (generally, to 16k tokens). At test time, we use the full input without truncation. We train one model per setting, using the hyperparameter settings from SLED (Ivgi et al., 2022) and early stopping. |