Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MixMin: Finding Data Mixtures via Convex Minimization
Authors: Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop and study a gradient-based approach for optimizing this convex objective, which we call Mix Min, and test it on language modeling and chemistry tasks. Mix Min was the only method that uniformly improved the data mixture in all our experiments. With Mix Min, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, Sci Q, and Open Web Math. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Toronto, Toronto, Canada 2Vector Institute, Toronto, Canada 3Department of Chemistry, University of Toronto, Toronto, Canada 4Structural Genomics Consortium, Toronto, Canada 5Department of Computer Science, Stanford University, Palo alto, USA. Correspondence to: Anvith Thudi <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Mix Min Require: Step size η, number of steps n, loss function L (either cross-entropy or ℓ2 2), samples Dt from the target distribution dt, and (cheap) models trained on each source ˆfp(x) dp P. Initialize: λp 1 |P | for all dp P, and pre-compute { ˆfp(x) x Dt, dp P} 1: for i = 1, . . . , n do 2: ˆfλ(x) P dp P λp ˆfp(x) 3: l λp |Dt| P (x,y) Dt L( ˆfλ(x), y) 4: g λl 5: λp λpeηgp P dp P λpeηgp for all dp P 6: end for Return{λp}dp P |
| Open Source Code | No | The paper does not explicitly provide access to source code for the Mix Min methodology described. It mentions adapting code for a baseline (Reg Mix) from a GitHub repository, but not for their own contribution. |
| Open Datasets | Yes | We used the domains in Slim Pajama (cer, 2023) as sources for pre-training...Sci Q (Welbl et al., 2017), PIQA (Bisk et al., 2020), ARC-Easy (Clark et al., 2018) and the first 10000 documents in Open Web Math (Paster et al., 2023). ...We worked with the PCBA dataset (Beaini et al., 2023). The dataset can be found at https://polarishub.io/ datasets/graphium/pcba-1328-1564k-v1. |
| Dataset Splits | Yes | We split the target task into a random 80% training set, and 20% test set... For every assay in PCBA, we used a 64% 16% 20% train-validation-test split: an original 80% 20% train-test split, and further splitting the train set into a 20% validation set. |
| Hardware Specification | Yes | Experiments were run using A100 GPUs and AMD EPYC 7643 CPUs. |
| Software Dependencies | No | The paper mentions using specific models like pythia-410M and tools like XGBoost and Pythia tokenizer, but does not specify software versions for these or other dependencies (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | We took the largest batch size that fits on a single A100 gpu, which was 64 for the 160Mpythia model and 32 for the 410M-pythia model for a context length of 1024. For the 160M-pythia model we increased the learning rate until training loss got worse... so chose 5e 3. For the 410M-pythia model we evaluated learning rates 5e 3 and 1e 2 and found 5e 3 was better... We grid search over all combinations of n estimators [10, 50, 100] and max depth [4, 6, 8]. For the models trained over all the surrogate assays... we fixed the n estimators = 100 and max depth = 6. |