Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement

Authors: Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan Arik, Tomas Pfister

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To verify the effectiveness, we conduct comprehensive evaluations of MLE-STAR using the MLE-bench s Kaggle competitions [16]. The experimental results demonstrate that MLE-STAR, requiring only minimal human effort (e.g., deﬁning initial prompts that are generalizable to any tasks), signiﬁcantly outperforms previous methods [12], including those requiring manual labor to collect strategies from Kaggle [10]. In particular, MLE-STAR achieves a substantial gain in medal achievement, improving it from 36.6% to 63.6% when compared to the top-performing baseline.
Researcher Affiliation	Collaboration	Jaehyun Nam1 2 , Jinsung Yoon1, Jiefeng Chen1 Jinwoo Shin2, Sercan Ö. Arık1, Tomas Pﬁster1 1Google Cloud, 2KAIST EMAIL, EMAIL
Pseudocode	Yes	The prompts and algorithms used in each step can be found in Appendix A and B, respectively.
Open Source Code	Yes	1We release open-source codebase of MLE-STAR at https://github.com/google/adk-samples.
Open Datasets	Yes	In this section, we validate the effectiveness of MLE-STAR using 22 Kaggle competitions from MLE-bench Lite [16].
Dataset Splits	Yes	We evaluate the performance of each s using a task-speciﬁc metric h on dataset D. We denote the resulting score by h(s), which encapsulates the entire process done in s: splitting D into training and validation sets, training the model speciﬁed in s using the training data, and calculating h on the validation data. Following the MLE-bench s setup, we set a maximum time limit of 24 hours for a fair comparison (see computation analysis in Appendix F).
Hardware Specification	Yes	Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justiﬁcation: We provide compute resources we used in Appendix F.
Software Dependencies	No	The paper mentions "scikit-learn library [15]" but does not specify a version number or other software dependencies with their versions.
Experiment Setup	Yes	All experiments are conducted on 22 Kaggle competitions from MLE-bench Lite [16] using three random seeds, unless otherwise speciﬁed. Here, we use an agent Atest, which takes the task description and the ﬁnal solution as input, and outputs the code that incorporates loading test sample and creating a submission ﬁle (see Appendix E for details). MLE-STAR begins by retrieving four model candidates. MLE-STAR reﬁnes for four inner loops, while exploring four outer loops. For ensemble, MLE-STAR generates two solutions in parallel, and explore ensemble strategies for ﬁve rounds. Following the MLE-bench s setup, we set a maximum time limit of 24 hours for a fair comparison (see computation analysis in Appendix F).