reproducibilityindex.ai

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Authors: Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, Mao Yang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on LLa MA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via Long Ro PE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations. Code is available at https://github.com/ microsoft/Long Ro PE
Researcher Affiliation	Collaboration	1Microsoft Research 2Hangzhou Dianzi University 3University of Science and Technology of China; Yiran Ding and Yuanyuan Xu did this work during the internship at MSRA. Correspondence to: Li Lyna Zhang <lzhani@microsoft.com>.
Pseudocode	Yes	Algorithm 1 The search algorithm for effective non-uniform positional interpolation
Open Source Code	Yes	Code is available at https://github.com/ microsoft/Long Ro PE
Open Datasets	Yes	For LLa MA2, we use a learning rate of 2e-5 with linear decay and a global batch size of 32. We fine-tune for 400 steps on Redpajama (Computer, 2023) dataset, chunked into 128k segments bookended with the BOS and EOS tokens.
Dataset Splits	Yes	The search is guided by perplexity, using 5 random samples from PG19 (Rae et al., 2019) validation set.
Hardware Specification	Yes	All our experiments are conduct on 16 A100 GPUs.
Software Dependencies	No	We employ Flash Attention-2 (Dao, 2023) to accelerate both training and inference. ... We reuse the data precision settings from the original Huggingface model checkpoints.
Experiment Setup	Yes	For LLa MA2, we use a learning rate of 2e-5 with linear decay and a global batch size of 32. We fine-tune for 400 steps on Redpajama (Computer, 2023) dataset, chunked into 128k segments bookended with the BOS and EOS tokens. Then, based on the finished checkpoint, we train an additional 600 steps to achieve 256k context window. The 128k context size is trained on 8 A100 GPUs with the distributed training system (Lin et al., 2023), while the 256k requires 16 A100 GPUs. In the case of Mistral, a constant learning rate of 1e-6 and a global batch size of 64 are used.