reproducibilityindex.ai

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Authors: Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S. Liang, Quoc V Le, Tengyu Ma, Adams Wei Yu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we use Do Re Mi on a 280Mparameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efﬁciently. On The Pile, Do Re Mi improves perplexity across all domains, even when it downweights a domain. Do Re Mi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile s default domain weights and reaches the baseline accuracy with 2.6x fewer training steps.
Researcher Affiliation	Collaboration	Sang Michael Xie1,2, Hieu Pham1, Xuanyi Dong1, Nan Du1, Hanxiao Liu1, Yifeng Lu1, Percy Liang2, Quoc V. Le1, Tengyu Ma2, and Adams Wei Yu1 1Google Deep Mind 2Stanford University
Pseudocode	Yes	Algorithm 1 Do Re Mi domain reweighting (Step 2)
Open Source Code	Yes	A public re-implementation of Do Re Mi and optimized domain weights for The Pile can be found at https://github.com/sangmichaelxie/doremi.
Open Datasets	Yes	The Pile [17], a large publicly available dataset, is composed of 24% web data, 9% Wikipedia, 4% Git Hub, etc.1 The GLa M dataset [13] (also used in training Pa LM [10]) includes text from 8 domains (Table 2).
Dataset Splits	No	We use held-out validation data to measure the perplexity on each domain.
Hardware Specification	Yes	Models under 1B parameters were trained with TPUv3 accelerators, while 1B and 8B models were trained with TPUv4.
Software Dependencies	No	Finally, we update the proxy model for the objective L(θt 1,αt) using a standard optimizer such as Adam [26] or Adafactor [46]. All experiments in this paper use Adafactor. We train Transformer [51] decoder-only LMs with the standard next-token language modeling loss.
Experiment Setup	Yes	All models use a batch size of 512 and maximum token length of 1024. The proxy and reference models have 280M parameters. For all training runs (including DRO runs), we train with a batch size of 512, initial learning rate of 1e-3, weight decay of 1e-2, and gradient clipping to norm 1. We decay the learning rate exponentially until it reaches a minimum of 1e-4 at the end of training, with a linear warmup of 6% of the total training steps. We train for 200k steps on The Pile and 300k steps on the GLa M dataset.