DOGE: Domain Reweighting with Generalization Estimation

Authors: Simin Fan, Matteo Pagliardini, Martin Jaggi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we extensively show how DOGE improves the generalization of the base model to any target data mixture. On the Slim Pajama dataset, our base model gets better perplexity and few-shot reasoning accuracies across 6 tasks compared to baseline methods. Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DOGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
Researcher Affiliation Academia Simin Fan 1 Matteo Pagliardini 1 Martin Jaggi 1 1EPFL, Switzerland. Correspondence to: Simin Fan <simin.fan@epfl.ch>.
Pseudocode Yes The final algorithm is summarized in Alg. 1. The detailed derivation is presented in Appendix ( B). Algorithm 1 DOGE Domain Reweighting (for Universal Generalization). Algorithm 2 DOGE Domain Reweighting (for Out-of-domain Generalization).
Open Source Code Yes We provide the codebase at https: //github.com/Olivia-fsm/doge.
Open Datasets Yes We experiment on Slim Pajama (Soboleva et al., 2023), which is a deduplicated version of Red Pajama consisting of data from 7 domains.
Dataset Splits No The paper mentions 'held-out validation sets' but does not provide specific split percentages or counts for the validation data to fully reproduce the splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for the experiments. It mentions training various model sizes but not the underlying hardware.
Software Dependencies No The paper mentions the vocabulary size of the tokenizer but does not specify software dependencies with version numbers (e.g., Python version, specific deep learning frameworks like PyTorch or TensorFlow versions).
Experiment Setup Yes Auxiliary models for both DOGE and DOREMI are trained for 10k iterations. The final domain weights are used to train larger base models (124M, 210M, 684M). All models are trained from scratch with batch size of 128, and sequence length of 512. The vocabulary size of the tokenizer is 50304. Details on model architectures are provided in App. A. The maximal (min.) learning rate applied to train the largest model (684M) is 1.5 10 4 (5 10 5), while others apply 5 10 4 (1 10 4), with a cosine scheduler. The weight decay for all models is set as 0.01, the gradient clip is set as 1.0.