Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding
Authors: Tong Wu, Yanpeng Zhao, Zilong Zheng
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that CREAM successfully extends LLMs to the target length for both Base and Chat versions of Llama2-7B with Never Miss A Beat . Our code is publicly available at https://github.com/bigai-nlco/cream. In Section 3, we conduct comprehensive experiments to demonstrate the efficiency and effectiveness of CREAM. We continually pre-trained on Llama 2-7B with CREAM for a short period and extend the context window size from 4K to up to 256K. |
| Researcher Affiliation | Academia | Tong Wu EMAIL Yanpeng Zhao EMAIL Zilong Zheng EMAIL State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Corresponding author. |
| Pseudocode | Yes | Algorithm 1 CREAM sampling algorithm |
| Open Source Code | Yes | Our code is publicly available at https://github.com/bigai-nlco/cream. |
| Open Datasets | Yes | For training the Base model, we directly utilize The Pile data provided by Zhu et al. [2023], and select samples with token lengths exceeding 4K. For training the Chat model, we filter the Share GPT data from public datasets5. |
| Dataset Splits | No | The paper states 'efficiently learning from a small-scale training data Dtrain with a maximum sequence length N' and uses '4K length data' for fine-tuning, but does not explicitly detail train/validation/test dataset splits by percentage or count. |
| Hardware Specification | Yes | We perform fine-tuning on two A100-80G GPUs with a total batch size of 32 and run inference on a single A100-80G GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch', 'Deep Speed', and 'Flash Attention-2' but does not specify their version numbers. |
| Experiment Setup | Yes | A learning rate of 2 10 5 with a linear scheduler is adopted, incorporating 10 warmup steps. We use the Adam W Loshchilov and Hutter [2018] optimizer with the hyperparameter configurations specified by Py Torch Paszke et al. [2019]. ... For CREAM-Base, we fine-tune it for 1,000 steps on a dataset derived from Pile Gao et al. [2020]; for CREAM-Chat, we fine-tune it for 100 steps on Share GPT Zheng et al. [2024]. ... We utilize two A100-80G machines with a global batch size of 32 |