DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Authors: Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S. Liang, Quoc V Le, Tengyu Ma, Adams Wei Yu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we use Do Re Mi on a 280Mparameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, Do Re Mi improves perplexity across all domains, even when it downweights a domain. Do Re Mi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile s default domain weights and reaches the baseline accuracy with 2.6x fewer training steps.
Researcher Affiliation Collaboration Sang Michael Xie1,2, Hieu Pham1, Xuanyi Dong1, Nan Du1, Hanxiao Liu1, Yifeng Lu1, Percy Liang2, Quoc V. Le1, Tengyu Ma2, and Adams Wei Yu1 1Google Deep Mind 2Stanford University
Pseudocode Yes Algorithm 1 Do Re Mi domain reweighting (Step 2)
Open Source Code Yes A public re-implementation of Do Re Mi and optimized domain weights for The Pile can be found at https://github.com/sangmichaelxie/doremi.
Open Datasets Yes The Pile [17], a large publicly available dataset, is composed of 24% web data, 9% Wikipedia, 4% Git Hub, etc.1 The GLa M dataset [13] (also used in training Pa LM [10]) includes text from 8 domains (Table 2).
Dataset Splits No We use held-out validation data to measure the perplexity on each domain.
Hardware Specification Yes Models under 1B parameters were trained with TPUv3 accelerators, while 1B and 8B models were trained with TPUv4.
Software Dependencies No Finally, we update the proxy model for the objective L(θt 1,αt) using a standard optimizer such as Adam [26] or Adafactor [46]. All experiments in this paper use Adafactor. We train Transformer [51] decoder-only LMs with the standard next-token language modeling loss.
Experiment Setup Yes All models use a batch size of 512 and maximum token length of 1024. The proxy and reference models have 280M parameters. For all training runs (including DRO runs), we train with a batch size of 512, initial learning rate of 1e-3, weight decay of 1e-2, and gradient clipping to norm 1. We decay the learning rate exponentially until it reaches a minimum of 1e-4 at the end of training, with a linear warmup of 6% of the total training steps. We train for 200k steps on The Pile and 300k steps on the GLa M dataset.