An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding
Authors: Tong Wu, Yanpeng Zhao, Zilong Zheng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that CREAM successfully extends LLMs to the target length for both Base and Chat versions of Llama2-7B with Never Miss A Beat . Our code is publicly available at https://github.com/bigai-nlco/cream. In Section 3, we conduct comprehensive experiments to demonstrate the efficiency and effectiveness of CREAM. We continually pre-trained on Llama 2-7B with CREAM for a short period and extend the context window size from 4K to up to 256K. |
| Researcher Affiliation | Academia | Tong Wu wutong1@bigai.ai Yanpeng Zhao zhaoyanpeng@bigai.ai Zilong Zheng zlzheng@bigai.ai State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Corresponding author. |
| Pseudocode | Yes | Algorithm 1 CREAM sampling algorithm |
| Open Source Code | Yes | Our code is publicly available at https://github.com/bigai-nlco/cream. |
| Open Datasets | Yes | For training the Base model, we directly utilize The Pile data provided by Zhu et al. [2023], and select samples with token lengths exceeding 4K. For training the Chat model, we filter the Share GPT data from public datasets5. |
| Dataset Splits | No | The paper states 'efficiently learning from a small-scale training data Dtrain with a maximum sequence length N' and uses '4K length data' for fine-tuning, but does not explicitly detail train/validation/test dataset splits by percentage or count. |
| Hardware Specification | Yes | We perform fine-tuning on two A100-80G GPUs with a total batch size of 32 and run inference on a single A100-80G GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch', 'Deep Speed', and 'Flash Attention-2' but does not specify their version numbers. |
| Experiment Setup | Yes | A learning rate of 2 10 5 with a linear scheduler is adopted, incorporating 10 warmup steps. We use the Adam W Loshchilov and Hutter [2018] optimizer with the hyperparameter configurations specified by Py Torch Paszke et al. [2019]. ... For CREAM-Base, we fine-tune it for 1,000 steps on a dataset derived from Pile Gao et al. [2020]; for CREAM-Chat, we fine-tune it for 100 steps on Share GPT Zheng et al. [2024]. ... We utilize two A100-80G machines with a global batch size of 32 |