Scaling Laws of RoPE-based Extrapolation

Authors: Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, Dahua Lin

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we conduct further experiments on increasing and decreasing the rotary base in Section 2 and subsequently discover that adjusting the rotary base in both directions can contribute to the extrapolation of Ro PE-based LLMs. We first conduct the extrapolation experiments with larger bases, based on the experimental setup in Appendix B.1.
Researcher Affiliation Collaboration 1Shanghai AI Lab 2School of Computer Science, Fudan University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes In summary, our contributions are as follows, and codes are available at https://github.com/Open LMLab/scaling-rope.
Open Datasets Yes We fine-tune the models for 1K steps using the next token prediction objective with training data from the Pile (Gao et al., 2021) and compare the tuning performance on the validation set of Books3 subset (Presser, 2020) from the Pile.
Dataset Splits No While the paper mentions using a 'validation set of Books3 subset', it does not provide specific dataset split percentages or absolute sample counts for training, validation, and test sets.
Hardware Specification Yes For fine-tuning 7B and 13B models, we use 32 A100 GPUs and adopt Ze RO3 strategies (Rajbhandari et al., 2020).
Software Dependencies No The paper mentions using software like Adam W, Co LLi E, Open Compass, and Flash Attention-2, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes For fine-tuning Ro PE with different bases, we set the global batch size to 128, tuning the context length to 4K, the same as the training length, and the evaluating context length to 100K. We fine-tune the models for 1K steps using the next token prediction objective... We set the learning rate to 2 10 5 with no warmup. We set the max gradient norm to 2.5 for 7B and 1 for 13B respectively. We set the weight decay to zero.