Random-Access Infinite Context Length for Transformers

Authors: Amirkeivan Mohtashami, Martin Jaggi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLa MA 7B with our method successfully extends its context length capacity to over 32k tokens, allowing for inference at the context lengths of GPT-4. We release the implementation of landmark attention and the code to reproduce our experiments at https://github.com/epfml/landmark-attention/.
Researcher Affiliation Academia Amirkeivan Mohtashami EPFL amirkeivan.mohtashami@epfl.ch Martin Jaggi EPFL martin.jaggi@epfl.ch
Pseudocode No The paper describes its methodology in text and equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes We release the implementation of landmark attention and the code to reproduce our experiments at https://github.com/epfml/landmark-attention/.
Open Datasets Yes We first evaluate the efficacy of retrieving earlier blocks on two language modeling tasks which can be expected to have long-range token interactions: English language books (PG-19) [29] (3.7B tokens), and math papers from ar Xiv (5.6B tokens). We provide additional details about the datasets in Appendix B.
Dataset Splits No To evaluate our model s performance with different context lengths, we divide the validation data into equally sized segments, referred to as evaluation lengths.
Hardware Specification Yes We used mixed-precision training with bfloat16 over at most 4 Nvidia A100 GPUs.
Software Dependencies No The paper mentions software components like 'Adam W', 'GPT-2 tokenizer', and 'Triton', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We trained our model using Adam W [22] with β1 = 0.9 and β2 = 0.95. We applied weight decay with factor 0.001. We used base learning rate 0.002 for all our experiments with a warmup stage that was 2% of the whole training and applied a cosine scheduler with minimum (final) learning rate being 0.0004. ... We used gradient accumulation as well as data-parallel training across four nodes to maintain an effective total batch size of 128. ... For our method, we train the model on each dataset for 240K steps with context length ℓseq = 512.