reproducibilityindex.ai

An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding

Authors: Tong Wu, Yanpeng Zhao, Zilong Zheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that CREAM successfully extends LLMs to the target length for both Base and Chat versions of Llama2-7B with Never Miss A Beat . Our code is publicly available at https://github.com/bigai-nlco/cream. In Section 3, we conduct comprehensive experiments to demonstrate the efficiency and effectiveness of CREAM. We continually pre-trained on Llama 2-7B with CREAM for a short period and extend the context window size from 4K to up to 256K.
Researcher Affiliation	Academia	Tong Wu wutong1@bigai.ai Yanpeng Zhao zhaoyanpeng@bigai.ai Zilong Zheng zlzheng@bigai.ai State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Corresponding author.
Pseudocode	Yes	Algorithm 1 CREAM sampling algorithm
Open Source Code	Yes	Our code is publicly available at https://github.com/bigai-nlco/cream.
Open Datasets	Yes	For training the Base model, we directly utilize The Pile data provided by Zhu et al. [2023], and select samples with token lengths exceeding 4K. For training the Chat model, we filter the Share GPT data from public datasets5.
Dataset Splits	No	The paper states 'efficiently learning from a small-scale training data Dtrain with a maximum sequence length N' and uses '4K length data' for fine-tuning, but does not explicitly detail train/validation/test dataset splits by percentage or count.
Hardware Specification	Yes	We perform fine-tuning on two A100-80G GPUs with a total batch size of 32 and run inference on a single A100-80G GPU.
Software Dependencies	No	The paper mentions 'Py Torch', 'Deep Speed', and 'Flash Attention-2' but does not specify their version numbers.
Experiment Setup	Yes	A learning rate of 2 10 5 with a linear scheduler is adopted, incorporating 10 warmup steps. We use the Adam W Loshchilov and Hutter [2018] optimizer with the hyperparameter configurations specified by Py Torch Paszke et al. [2019]. ... For CREAM-Base, we fine-tune it for 1,000 steps on a dataset derived from Pile Gao et al. [2020]; for CREAM-Chat, we fine-tune it for 100 steps on Share GPT Zheng et al. [2024]. ... We utilize two A100-80G machines with a global batch size of 32