LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Authors: Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, Mao Yang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on LLa MA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via Long Ro PE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations. Code is available at https://github.com/ microsoft/Long Ro PE |
| Researcher Affiliation | Collaboration | 1Microsoft Research 2Hangzhou Dianzi University 3University of Science and Technology of China; Yiran Ding and Yuanyuan Xu did this work during the internship at MSRA. Correspondence to: Li Lyna Zhang <lzhani@microsoft.com>. |
| Pseudocode | Yes | Algorithm 1 The search algorithm for effective non-uniform positional interpolation |
| Open Source Code | Yes | Code is available at https://github.com/ microsoft/Long Ro PE |
| Open Datasets | Yes | For LLa MA2, we use a learning rate of 2e-5 with linear decay and a global batch size of 32. We fine-tune for 400 steps on Redpajama (Computer, 2023) dataset, chunked into 128k segments bookended with the BOS and EOS tokens. |
| Dataset Splits | Yes | The search is guided by perplexity, using 5 random samples from PG19 (Rae et al., 2019) validation set. |
| Hardware Specification | Yes | All our experiments are conduct on 16 A100 GPUs. |
| Software Dependencies | No | We employ Flash Attention-2 (Dao, 2023) to accelerate both training and inference. ... We reuse the data precision settings from the original Huggingface model checkpoints. |
| Experiment Setup | Yes | For LLa MA2, we use a learning rate of 2e-5 with linear decay and a global batch size of 32. We fine-tune for 400 steps on Redpajama (Computer, 2023) dataset, chunked into 128k segments bookended with the BOS and EOS tokens. Then, based on the finished checkpoint, we train an additional 600 steps to achieve 256k context window. The 128k context size is trained on 8 A100 GPUs with the distributed training system (Lin et al., 2023), while the 256k requires 16 A100 GPUs. In the case of Mistral, a constant learning rate of 1e-6 and a global batch size of 64 are used. |