reproducibilityindex.ai

Random-Access Infinite Context Length for Transformers

Authors: Amirkeivan Mohtashami, Martin Jaggi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLa MA 7B with our method successfully extends its context length capacity to over 32k tokens, allowing for inference at the context lengths of GPT-4. We release the implementation of landmark attention and the code to reproduce our experiments at https://github.com/epfml/landmark-attention/.
Researcher Affiliation	Academia	Amirkeivan Mohtashami EPFL amirkeivan.mohtashami@epfl.ch Martin Jaggi EPFL martin.jaggi@epfl.ch
Pseudocode	No	The paper describes its methodology in text and equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	We release the implementation of landmark attention and the code to reproduce our experiments at https://github.com/epfml/landmark-attention/.
Open Datasets	Yes	We first evaluate the efficacy of retrieving earlier blocks on two language modeling tasks which can be expected to have long-range token interactions: English language books (PG-19) [29] (3.7B tokens), and math papers from ar Xiv (5.6B tokens). We provide additional details about the datasets in Appendix B.
Dataset Splits	No	To evaluate our model s performance with different context lengths, we divide the validation data into equally sized segments, referred to as evaluation lengths.
Hardware Specification	Yes	We used mixed-precision training with bfloat16 over at most 4 Nvidia A100 GPUs.
Software Dependencies	No	The paper mentions software components like 'Adam W', 'GPT-2 tokenizer', and 'Triton', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We trained our model using Adam W [22] with β1 = 0.9 and β2 = 0.95. We applied weight decay with factor 0.001. We used base learning rate 0.002 for all our experiments with a warmup stage that was 2% of the whole training and applied a cosine scheduler with minimum (final) learning rate being 0.0004. ... We used gradient accumulation as well as data-parallel training across four nodes to maintain an effective total batch size of 128. ... For our method, we train the model on each dataset for 240K steps with context length ℓseq = 512.