YaRN: Efficient Context Window Extension of Large Language Models

Authors: Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the long sequence language modeling performances, we use the Gov Report (Huang et al., 2021) and Proof-pile (Azerbayev et al., 2022) datasets both of which contain many long sequence samples. For all evaluations, the test splits of both datasets were used exclusively. All perplexity evaluations were calculated using the sliding window method from Press et al. (2022) with S = 256, which takes in account the entire documents perplexity contribution, even if the context window of the model is shorter. Table 1 shows the long sequence performance of fine-tuned Llama 2 s = 16 and s = 32 models. We demonstrate that Ya RN is able to generalize and extrapolate to unseen context lengths and benefit from transfer learning, since the s = 32 model was only further trained for 200 steps using the s = 16 checkpoint with 64k data and is able to extrapolate to 128k context.
Researcher Affiliation Collaboration Bowen Peng1 Jeffrey Quesnelle1 Honglu Fan23 Enrico Shippole 1Nous Research 2Eleuther AI 3University of Geneva
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes To aid in reproducibility, we provide, as supplementary material, the entirety of of the code used to train the Ya RN models in Table 7, as well as the evaluation code that produced Figure 7 and Tables 6, 7, 10, 8, and 9. The code also contains implementations of various extension methods referenced throughout the paper.
Open Datasets Yes For the s = 16 model, we fine-tuned for 400 steps with global batch size 64 using Py Torch (Paszke et al., 2019) Fully Sharded Data Parallelism (Zhao et al., 2023) and Flash Attention 2 (Dao, 2023) on the PG19 dataset (Rae et al., 2020) chunked into 64k segments bookended with the BOS and EOS token.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits or mention the use of a validation set during training with specific percentages or sample counts.
Hardware Specification Yes Table 4: Comparison of training time in A100-hours for different open and closed models using different extension methods.
Software Dependencies No The paper mentions software like PyTorch and Flash Attention 2, but does not provide specific version numbers for these or other software components, which is required for reproducible description.
Experiment Setup Yes We used a learning rate of 2 10 5 with no weight decay and a linear warmup of 20 steps along with Adam W (Loshchilov and Hutter, 2019) β1 = 0.9 and β2 = 0.95. For the s = 16 model, we fine-tuned for 400 steps with global batch size 64 using Py Torch (Paszke et al., 2019) Fully Sharded Data Parallelism (Zhao et al., 2023) and Flash Attention 2 (Dao, 2023) on the PG19 dataset (Rae et al., 2020) chunked into 64k segments bookended with the BOS and EOS token.