reproducibilityindex.ai

Scaling Laws of RoPE-based Extrapolation

Authors: Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, Dahua Lin

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we conduct further experiments on increasing and decreasing the rotary base in Section 2 and subsequently discover that adjusting the rotary base in both directions can contribute to the extrapolation of Ro PE-based LLMs. We first conduct the extrapolation experiments with larger bases, based on the experimental setup in Appendix B.1.
Researcher Affiliation	Collaboration	1Shanghai AI Lab 2School of Computer Science, Fudan University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	In summary, our contributions are as follows, and codes are available at https://github.com/Open LMLab/scaling-rope.
Open Datasets	Yes	We fine-tune the models for 1K steps using the next token prediction objective with training data from the Pile (Gao et al., 2021) and compare the tuning performance on the validation set of Books3 subset (Presser, 2020) from the Pile.
Dataset Splits	No	While the paper mentions using a 'validation set of Books3 subset', it does not provide specific dataset split percentages or absolute sample counts for training, validation, and test sets.
Hardware Specification	Yes	For fine-tuning 7B and 13B models, we use 32 A100 GPUs and adopt Ze RO3 strategies (Rajbhandari et al., 2020).
Software Dependencies	No	The paper mentions using software like Adam W, Co LLi E, Open Compass, and Flash Attention-2, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For fine-tuning Ro PE with different bases, we set the global batch size to 128, tuning the context length to 4K, the same as the training length, and the evaluating context length to 100K. We fine-tune the models for 1K steps using the next token prediction objective... We set the learning rate to 2 10 5 with no warmup. We set the max gradient norm to 2.5 for 7B and 1 for 13B respectively. We set the weight decay to zero.