Data Engineering for Scaling Language Models to 128K Context

Authors: Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, Hao Peng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the continual pretraining recipe for scaling language models context lengths to 128K, with a focus on data engineering. We investigate the quantity and quality of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize domain balance and length upsampling. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.
Researcher Affiliation Collaboration 1University Edinburgh 2MIT-IBM Watson AI Lab 3University of Melbourne 4CMU 5University of Washington 6Massachusetts Institute of Technology 7University of Illinois at Urbana-Champaign.
Pseudocode No The paper describes its methods in prose and through experimental setups, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper cites external works and benchmarks that have associated code (e.g., GitHub links for 'Needle-in-a-Haystack test'), but there is no explicit statement or link provided by the authors for the open-sourcing of their own methodology's code.
Open Datasets Yes We use the Slim Pajama (Soboleva et al., 2023) dataset for continual pretraining. ... Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., and Dey, N. Slim Pajama: A 627B token cleaned and deduplicated version of Red Pajama, 2023. URL https://huggingface.co/datasets/cerebras/Slim Pajama-627B.
Dataset Splits No The paper mentions using 'validation loss' and shows 'Loss Comparison' in Figure 4, implying the use of a validation set. However, it does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., exact percentages, sample counts, or references to predefined splits).
Hardware Specification Yes Our specific configuration is listed on Table 2. We note that this configuration is substantially cheaper than previous work (Xiong et al., 2023)... Hardware 8 80G A100 LLa MA 2 7B Ctx. 4K 3 days / 10B tokens Ctx. 80K 10 days / 10B tokens LLa MA 2 13B Ctx. 4K 5 days / 10B tokens Ctx. 64K 13 days / 10B tokens Hardware 2 x 8 80G A100...
Software Dependencies No The paper lists software components such as 'Huggingface Transformers + Deep Speed Zero 3 + Flash Attention 2 + Gradient Checkpointing + CPU Offloading' under 'Framework' in Table 2, but it does not specify version numbers for these software dependencies.
Experiment Setup Yes For training, we use a constant learning rate 2e-5. We modify the base of Ro PE positional encoding to adjust it to longer context, as in Xiong et al. (2023). We pack all data to 80K chunks regardless of the document boundary, following common practice (Raffel et al., 2020; Touvron et al., 2023a). We set the batch size to be 4M tokens. ... We train the model on 5B tokens, which translates to 5B (size of data) / 4M (batch size) = 2000 optimization steps.