Base of RoPE Bounds Context Length

Authors: Mingyu Xu, Xin Men, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, weipeng chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct thorough experiments. To verify our theory, we conducted thorough experiments on various LLMs such as Llama2-7B [17], Baichuan2-7B [8] and a 2-billion model we trained from scratch, demonstrating that this lower bound holds not only in the fine-tuning stage but also in the pre-training stage.
Researcher Affiliation Collaboration Mingyu Xu1 , Xin Men1 , Bingning Wang1 , Qingyu Zhang2, Hongyu Lin2, Yaojie Lu2, Xianpei Han2 and Weipeng Chen 1 1 Baichuan Inc. 2 Chinese Information Processing Laboratory Institute of Software, Chinese Academy of Sciences
Pseudocode Yes E The python code for calculating the low bound base for a context length of 32k 1 """ the python code for calculate the 2 low bound base for a context length of 32k""" 3 import torch 4 import numpy as np 5 def get_BMtheta_expectation (base ,context_size =2**15 , dim =128): 6 realdim = dim / 2 7 d= torch.arange (0, realdim , 1) 8 theta = base ** (-2*d/dim) 9 dist= torch.outer(torch.arange (0, context_size),theta).cos() 10 return dist.sum(dim =1) / realdim 11 search_base = [] 12 for x in range (3 ,10): 13 for i in range (1 ,10): 14 for j in range (10): 15 search_base.append ((i+j/10)* (10**x)) 16 for base in search_base: 17 ans = get_BMtheta_expectation (base) 18 if True not in (ans <0): 19 print("Find!Base=", base) 21 idx = np.argmax(ans < 0) 22 print( base , base , first zero position , idx)
Open Source Code No Due to some privacy policies and the complexity of training large language models, we are unable to provide the data and code. But we believe that based on the open-source LLMs and open-source code, our results can be reproduced.
Open Datasets Yes The dataset we used is a subset of Red Pajama [24].
Dataset Splits No The paper mentions using a subset of Red Pajama for training and fine-tuning and the PG19 dataset for perplexity evaluation on its 'test split'. However, it does not provide explicit train/validation/test dataset splits (e.g., percentages or sample counts) for the Red Pajama subset used in their primary experiments, nor does it detail how these splits were created or used for training/validation beyond mentioning test set for evaluation datasets.
Hardware Specification Yes All experiments are conducted on a cluster of 16 machines with 128 NVIDIA A100 80G.
Software Dependencies No The paper mentions software like 'Flash Attention-2 [40] and Megatron-LM [44]' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We utilized a fixed learning rate of 2e-5 and a global batch size of 128 and fine-tuning for 1000 steps. For pre-training, we trained a Llama-like 2B model from scratch for a total of 1 trillion tokens. We set the learning rate to 1e-4 and adopted a cosine decay schedule, with models trained on a total of 1T tokens. Table 4 in Appendix B, "Training hyper-parameters in our experiments" which details "Training length", "Training tokens", "Batchsize", "Base LR", "LR decay", "Weight decay" for each model.