reproducibilityindex.ai

DOGE: Domain Reweighting with Generalization Estimation

Authors: Simin Fan, Matteo Pagliardini, Martin Jaggi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we extensively show how DOGE improves the generalization of the base model to any target data mixture. On the Slim Pajama dataset, our base model gets better perplexity and few-shot reasoning accuracies across 6 tasks compared to baseline methods. Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DOGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
Researcher Affiliation	Academia	Simin Fan 1 Matteo Pagliardini 1 Martin Jaggi 1 1EPFL, Switzerland. Correspondence to: Simin Fan <simin.fan@epfl.ch>.
Pseudocode	Yes	The final algorithm is summarized in Alg. 1. The detailed derivation is presented in Appendix ( B). Algorithm 1 DOGE Domain Reweighting (for Universal Generalization). Algorithm 2 DOGE Domain Reweighting (for Out-of-domain Generalization).
Open Source Code	Yes	We provide the codebase at https: //github.com/Olivia-fsm/doge.
Open Datasets	Yes	We experiment on Slim Pajama (Soboleva et al., 2023), which is a deduplicated version of Red Pajama consisting of data from 7 domains.
Dataset Splits	No	The paper mentions 'held-out validation sets' but does not provide specific split percentages or counts for the validation data to fully reproduce the splits.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for the experiments. It mentions training various model sizes but not the underlying hardware.
Software Dependencies	No	The paper mentions the vocabulary size of the tokenizer but does not specify software dependencies with version numbers (e.g., Python version, specific deep learning frameworks like PyTorch or TensorFlow versions).
Experiment Setup	Yes	Auxiliary models for both DOGE and DOREMI are trained for 10k iterations. The final domain weights are used to train larger base models (124M, 210M, 684M). All models are trained from scratch with batch size of 128, and sequence length of 512. The vocabulary size of the tokenizer is 50304. Details on model architectures are provided in App. A. The maximal (min.) learning rate applied to train the largest model (684M) is 1.5 10 4 (5 10 5), while others apply 5 10 4 (1 10 4), with a cosine scheduler. The weight decay for all models is set as 0.01, the gradient clip is set as 1.0.