Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MixMin: Finding Data Mixtures via Convex Minimization

Authors: Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop and study a gradient-based approach for optimizing this convex objective, which we call Mix Min, and test it on language modeling and chemistry tasks. Mix Min was the only method that uniformly improved the data mixture in all our experiments. With Mix Min, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, Sci Q, and Open Web Math.
Researcher Affiliation	Academia	1Department of Computer Science, University of Toronto, Toronto, Canada 2Vector Institute, Toronto, Canada 3Department of Chemistry, University of Toronto, Toronto, Canada 4Structural Genomics Consortium, Toronto, Canada 5Department of Computer Science, Stanford University, Palo alto, USA. Correspondence to: Anvith Thudi <EMAIL>.
Pseudocode	Yes	Algorithm 1 Mix Min Require: Step size η, number of steps n, loss function L (either cross-entropy or ℓ2 2), samples Dt from the target distribution dt, and (cheap) models trained on each source ˆfp(x) dp P. Initialize: λp 1 \|P \| for all dp P, and pre-compute { ˆfp(x) x Dt, dp P} 1: for i = 1, . . . , n do 2: ˆfλ(x) P dp P λp ˆfp(x) 3: l λp \|Dt\| P (x,y) Dt L( ˆfλ(x), y) 4: g λl 5: λp λpeηgp P dp P λpeηgp for all dp P 6: end for Return{λp}dp P
Open Source Code	No	The paper does not explicitly provide access to source code for the Mix Min methodology described. It mentions adapting code for a baseline (Reg Mix) from a GitHub repository, but not for their own contribution.
Open Datasets	Yes	We used the domains in Slim Pajama (cer, 2023) as sources for pre-training...Sci Q (Welbl et al., 2017), PIQA (Bisk et al., 2020), ARC-Easy (Clark et al., 2018) and the first 10000 documents in Open Web Math (Paster et al., 2023). ...We worked with the PCBA dataset (Beaini et al., 2023). The dataset can be found at https://polarishub.io/ datasets/graphium/pcba-1328-1564k-v1.
Dataset Splits	Yes	We split the target task into a random 80% training set, and 20% test set... For every assay in PCBA, we used a 64% 16% 20% train-validation-test split: an original 80% 20% train-test split, and further splitting the train set into a 20% validation set.
Hardware Specification	Yes	Experiments were run using A100 GPUs and AMD EPYC 7643 CPUs.
Software Dependencies	No	The paper mentions using specific models like pythia-410M and tools like XGBoost and Pythia tokenizer, but does not specify software versions for these or other dependencies (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup	Yes	We took the largest batch size that fits on a single A100 gpu, which was 64 for the 160Mpythia model and 32 for the 410M-pythia model for a context length of 1024. For the 160M-pythia model we increased the learning rate until training loss got worse... so chose 5e 3. For the 410M-pythia model we evaluated learning rates 5e 3 and 1e 2 and found 5e 3 was better... We grid search over all combinations of n estimators [10, 50, 100] and max depth [4, 6, 8]. For the models trained over all the surrogate assays... we fixed the n estimators = 100 and max depth = 6.