Unlimiformer: Long-Range Transformers with Unlimited Length Input

Authors: Amanda Bertsch, Uri Alon, Graham Neubig, Matthew Gormley

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the Book Sum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART (Lewis et al., 2020a) and Longformer (Beltagy et al., 2020) by extending them to unlimited inputs without additional learned weights and without modifying their code.
Researcher Affiliation Academia Amanda Bertsch Uri Alon Graham Neubig Matthew R. Gormley Carnegie Mellon University, USA {abertsch,ualon,gneubig,mgormley}@cs.cmu.edu
Pseudocode No The paper describes the methods and procedures in narrative text and with mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code and models are publicly available, and support LLa MA-2 as well. https://github.com/abertsch72/unlimiformer. Toward this end, we release our code at https: //github.com/abertsch72/unlimiformer.
Open Datasets Yes We experiment with two long-documentand one book-summarization datasets from varying domains. Table 2 summarizes statistics for each dataset. Gov Report and Summ Screen were taken from the SCROLLS benchmark (Shaham et al., 2022). Gov Report (Huang et al., 2021) is a long-document summarization dataset... Summ Screen (Chen et al., 2022) is a long-document summarization dataset... Book Sum (Kry sci nski et al., 2021) is a book-summarization dataset...
Dataset Splits Yes Appendix B: We trained on 10,000 randomly selected examples from this version of Wiki Sum and evaluate on 2,000 randomly sampled examples (1,000 validation, 1,000 test), maintaining the same sample across all experiments.
Hardware Specification Yes Appendix D, Computational Cost: We estimate the total GPU time for results presented in this paper did not exceed approximately 116 days of time on a single 48-GB A6000.
Software Dependencies No The paper states: "Our code is based on Hugging Face Transformers (Wolf et al., 2020)", but it does not specify the version number for Hugging Face Transformers or any other software dependencies.
Experiment Setup Yes Appendix A, Training details: At training time, we must backpropagate through the operations described above. Thus, the input length is bounded more strictly the number of tokens in the full input must fit in GPU memory while the model is loaded. For the computationally expensive methods, we train using batch size 1 and truncate the longest inputs (generally, to 16k tokens). At test time, we use the full input without truncation. We train one model per setting, using the hyperparameter settings from SLED (Ivgi et al., 2022) and early stopping.