Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

Authors: Thomson Yen, Andrew Siah, Haozhe Chen, C. Guetta, Tianyi Peng, Hongseok Namkoong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To accelerate methodological progress, we build a simulator based on 472 language model pre-training runs with varying data compositions from the Slim Pajama dataset. We observe that even simple kernels and acquisition functions can enable principled decisions across training models from 20M to 1B parameters and achieve 2.6x and 3.3x speedups compared to multi-fidelity Bayesian optimization and random search baselines. [...] For each of the algorithms evaluated, we run 20 experiments over different seeds, and show the one standard deviation bound with shaded regions.
Researcher Affiliation	Academia	Thomson Yen Decision, Risk, and Operations Division Columbia Business School EMAIL
Pseudocode	Yes	Algorithm 1 Gaussian Process and EIpu [...] Algorithm 2 Hyperband with Random Forest, EI
Open Source Code	Yes	Our code is available at https://github.com/namkoong-lab/data-recipes.
Open Datasets	Yes	To generate the training data for our predictors, we pretrained 472 language models using the OLMo 2 package (OLMo et al., 2024) with datasets derived from Slim Pajama (Shen et al., 2024), a deduplicated subset of Red Pajama (Weber et al., 2024).
Dataset Splits	Yes	We trained the predictor on 472 language model training runs described in Section 2.1, where 422 of the runs are randomly selected as a training set and the remaining 50 runs as a validation set.
Hardware Specification	Yes	The entire dataset was collected using 4x NVIDIA H100 80GB HBM3 for 500 compute days.
Software Dependencies	Yes	We use the OLMo 2 OLMo et al. (2024) package for training our language models. [...] We employ Hyperband, implemented via SMAC (Lindauer et al., 2022). [...] The GP hyperparameters are trained using the Adam optimizer (Kingma and Ba, 2014) with a 0.1 learning rate for 50 iterations.
Experiment Setup	Yes	Training of our predictor is conducted for 20 epochs using a batch size of 64, an Adam optimizer with a learning rate of 0.001 and weight decay of 0.01, and data normalization (standard scaling) applied to both inputs and outputs. [...] For multi-fidelity multi-scale GP, we limit the space of training steps to be Z = {6000, 12000, 19700}. [...] The GP hyperparameters are trained using the Adam optimizer (Kingma and Ba, 2014) with a 0.1 learning rate for 50 iterations.