Random-Access Infinite Context Length for Transformers
Authors: Amirkeivan Mohtashami, Martin Jaggi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLa MA 7B with our method successfully extends its context length capacity to over 32k tokens, allowing for inference at the context lengths of GPT-4. We release the implementation of landmark attention and the code to reproduce our experiments at https://github.com/epfml/landmark-attention/. |
| Researcher Affiliation | Academia | Amirkeivan Mohtashami EPFL amirkeivan.mohtashami@epfl.ch Martin Jaggi EPFL martin.jaggi@epfl.ch |
| Pseudocode | No | The paper describes its methodology in text and equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | We release the implementation of landmark attention and the code to reproduce our experiments at https://github.com/epfml/landmark-attention/. |
| Open Datasets | Yes | We first evaluate the efficacy of retrieving earlier blocks on two language modeling tasks which can be expected to have long-range token interactions: English language books (PG-19) [29] (3.7B tokens), and math papers from ar Xiv (5.6B tokens). We provide additional details about the datasets in Appendix B. |
| Dataset Splits | No | To evaluate our model s performance with different context lengths, we divide the validation data into equally sized segments, referred to as evaluation lengths. |
| Hardware Specification | Yes | We used mixed-precision training with bfloat16 over at most 4 Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Adam W', 'GPT-2 tokenizer', and 'Triton', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We trained our model using Adam W [22] with β1 = 0.9 and β2 = 0.95. We applied weight decay with factor 0.001. We used base learning rate 0.002 for all our experiments with a warmup stage that was 2% of the whole training and applied a cosine scheduler with minimum (final) learning rate being 0.0004. ... We used gradient accumulation as well as data-parallel training across four nodes to maintain an effective total batch size of 128. ... For our method, we train the model on each dataset for 240K steps with context length ℓseq = 512. |