An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding

Authors: Tong Wu, Yanpeng Zhao, Zilong Zheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that CREAM successfully extends LLMs to the target length for both Base and Chat versions of Llama2-7B with Never Miss A Beat . Our code is publicly available at https://github.com/bigai-nlco/cream. In Section 3, we conduct comprehensive experiments to demonstrate the efficiency and effectiveness of CREAM. We continually pre-trained on Llama 2-7B with CREAM for a short period and extend the context window size from 4K to up to 256K.
Researcher Affiliation Academia Tong Wu wutong1@bigai.ai Yanpeng Zhao zhaoyanpeng@bigai.ai Zilong Zheng zlzheng@bigai.ai State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Corresponding author.
Pseudocode Yes Algorithm 1 CREAM sampling algorithm
Open Source Code Yes Our code is publicly available at https://github.com/bigai-nlco/cream.
Open Datasets Yes For training the Base model, we directly utilize The Pile data provided by Zhu et al. [2023], and select samples with token lengths exceeding 4K. For training the Chat model, we filter the Share GPT data from public datasets5.
Dataset Splits No The paper states 'efficiently learning from a small-scale training data Dtrain with a maximum sequence length N' and uses '4K length data' for fine-tuning, but does not explicitly detail train/validation/test dataset splits by percentage or count.
Hardware Specification Yes We perform fine-tuning on two A100-80G GPUs with a total batch size of 32 and run inference on a single A100-80G GPU.
Software Dependencies No The paper mentions 'Py Torch', 'Deep Speed', and 'Flash Attention-2' but does not specify their version numbers.
Experiment Setup Yes A learning rate of 2 10 5 with a linear scheduler is adopted, incorporating 10 warmup steps. We use the Adam W Loshchilov and Hutter [2018] optimizer with the hyperparameter configurations specified by Py Torch Paszke et al. [2019]. ... For CREAM-Base, we fine-tune it for 1,000 steps on a dataset derived from Pile Gao et al. [2020]; for CREAM-Chat, we fine-tune it for 100 steps on Share GPT Zheng et al. [2024]. ... We utilize two A100-80G machines with a global batch size of 32