DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Authors: Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S. Liang, Quoc V Le, Tengyu Ma, Adams Wei Yu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we use Do Re Mi on a 280Mparameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, Do Re Mi improves perplexity across all domains, even when it downweights a domain. Do Re Mi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile s default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. |
| Researcher Affiliation | Collaboration | Sang Michael Xie1,2, Hieu Pham1, Xuanyi Dong1, Nan Du1, Hanxiao Liu1, Yifeng Lu1, Percy Liang2, Quoc V. Le1, Tengyu Ma2, and Adams Wei Yu1 1Google Deep Mind 2Stanford University |
| Pseudocode | Yes | Algorithm 1 Do Re Mi domain reweighting (Step 2) |
| Open Source Code | Yes | A public re-implementation of Do Re Mi and optimized domain weights for The Pile can be found at https://github.com/sangmichaelxie/doremi. |
| Open Datasets | Yes | The Pile [17], a large publicly available dataset, is composed of 24% web data, 9% Wikipedia, 4% Git Hub, etc.1 The GLa M dataset [13] (also used in training Pa LM [10]) includes text from 8 domains (Table 2). |
| Dataset Splits | No | We use held-out validation data to measure the perplexity on each domain. |
| Hardware Specification | Yes | Models under 1B parameters were trained with TPUv3 accelerators, while 1B and 8B models were trained with TPUv4. |
| Software Dependencies | No | Finally, we update the proxy model for the objective L(θt 1,αt) using a standard optimizer such as Adam [26] or Adafactor [46]. All experiments in this paper use Adafactor. We train Transformer [51] decoder-only LMs with the standard next-token language modeling loss. |
| Experiment Setup | Yes | All models use a batch size of 512 and maximum token length of 1024. The proxy and reference models have 280M parameters. For all training runs (including DRO runs), we train with a batch size of 512, initial learning rate of 1e-3, weight decay of 1e-2, and gradient clipping to norm 1. We decay the learning rate exponentially until it reaches a minimum of 1e-4 at the end of training, with a linear warmup of 6% of the total training steps. We train for 200k steps on The Pile and 300k steps on the GLa M dataset. |