Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Authors: Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats Leon Richter, Quentin Gregory Anthony, Eugene Belilovsky, Timothée Lesort, Irina Rish
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a large-scale empirical study of continual learning techniques for LLM pre-training. Our empirical evaluation spans large (10B parameters) and small (405M parameters) decoder-only transformer models as well as weak (English English) and stronger (English German) distribution shifts. Our main contributions can be summarized as follows: 1. We establish the effect of learning rate re-warming and re-decaying for decoder-only transformer-based LLMs pre-trained using a cosine schedule... 2. We establish the effect of replaying previous data... 3. We demonstrate, across two model sizes and distribution shifts, that a simple and scalable combination of LR re-warming, LR re-decaying, and compute-equivalent replay allows continually pre-trained decoder-only transformer-based LLMs to attain similar performance on average to models re-trained on the union of all data while using significantly less compute. |
| Researcher Affiliation | Academia | Adam Ibrahim EMAIL Benjamin Thérien EMAIL Kshitij Gupta EMAIL Mats L. Richter EMAIL Quentin Anthony EMAIL Timothée Lesort EMAIL Eugene Belilovsky EMAIL Irina Rish EMAIL Department of Computer Science and Operation Research, Université de Montréal, Montréal, Canada Department of Computer Science and Software Engineering, Concordia University, Montréal, Canada Mila, Montréal, Canada Eleuther AI |
| Pseudocode | No | The paper includes mathematical equations for learning rate schedules in Section 4.1 and 7.2, which describe the function of the learning rate over time. However, these are not presented as structured pseudocode or algorithm blocks with numbered steps typically found in algorithm descriptions. |
| Open Source Code | Yes | Our code is available at https://github.com/Eleuther AI/gpt-neox through pull requests 1194 and 1200. Model checkpoints throughout continual pre-training for most of our models are available at https: //huggingface.co/collections/cerc-aai/continual-pre-training-661f4af4379b82d9617a9401. |
| Open Datasets | Yes | We use three datasets for training and validation: Slim Pajama (Soboleva et al., 2023), German Common Crawl (Laippala et al., 2022), and Pile (Gao et al., 2020). |
| Dataset Splits | Yes | To create our training set for Slim Pajama, we randomly sub-sample the dataset (606B Total Tokens) to form a 299B token subset (see Table 11) that is of comparable size to Pile. We also further sub-sample this Slim Pajama subset to create three 100B token splits of the dataset (see Sec. 7.4 for details). ... To create the German training and validation sets, we split and tokenized the German Common Crawl scrape, available as part of the Oscar Dataset (Laippala et al., 2022), into a 195.43B token training set and a 982.6M token validation set. The Pile dataset comes pre-shuffled and mixed, we simply used the default training and validation sets. The training set is 330B tokens total, though in our experiments we only train on a 300B token subset. |
| Hardware Specification | Yes | This research was made possible thanks to the computing resources on the Summit supercomputer, provided as a part of the INCITE 2023 program award Scalable Foundation Models for Transferable Generalist AI. These resources were provided by the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. In particular, we thank Jens Glaser for his help with the Summit supercomputer. |
| Software Dependencies | No | The paper mentions using 'GPT-Neo X (Andonian et al., 2021) based on Megatron-Deep Speed (Shoeybi et al., 2019; Microsoft, 2020)' and 'Adam W optimizer (Loshchilov & Hutter, 2019)', but does not provide specific version numbers for these software components or other libraries/languages used. |
| Experiment Setup | Yes | For all models, we train with the Adam W optimizer (Loshchilov & Hutter, 2019) using a batch size of 1104 and a sequence length of 2048. An epoch of training approximately corresponds to 132, 366 total training steps. As mentioned in the previous section, we reset the optimizer states between datasets. We consider two model sizes 405M and 9.6B parameters (referred to as 10B in this work) including embeddings. ... We provided an extended description of all hyperparameters in the appendix (Table. 13). |