Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Authors: Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan Arik, Tomas Pfister
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To verify the effectiveness, we conduct comprehensive evaluations of MLE-STAR using the MLE-bench s Kaggle competitions [16]. The experimental results demonstrate that MLE-STAR, requiring only minimal human effort (e.g., defining initial prompts that are generalizable to any tasks), significantly outperforms previous methods [12], including those requiring manual labor to collect strategies from Kaggle [10]. In particular, MLE-STAR achieves a substantial gain in medal achievement, improving it from 36.6% to 63.6% when compared to the top-performing baseline. |
| Researcher Affiliation | Collaboration | Jaehyun Nam1 2 , Jinsung Yoon1, Jiefeng Chen1 Jinwoo Shin2, Sercan Ö. Arık1, Tomas Pfister1 1Google Cloud, 2KAIST EMAIL, EMAIL |
| Pseudocode | Yes | The prompts and algorithms used in each step can be found in Appendix A and B, respectively. |
| Open Source Code | Yes | 1We release open-source codebase of MLE-STAR at https://github.com/google/adk-samples. |
| Open Datasets | Yes | In this section, we validate the effectiveness of MLE-STAR using 22 Kaggle competitions from MLE-bench Lite [16]. |
| Dataset Splits | Yes | We evaluate the performance of each s using a task-specific metric h on dataset D. We denote the resulting score by h(s), which encapsulates the entire process done in s: splitting D into training and validation sets, training the model specified in s using the training data, and calculating h on the validation data. Following the MLE-bench s setup, we set a maximum time limit of 24 hours for a fair comparison (see computation analysis in Appendix F). |
| Hardware Specification | Yes | Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide compute resources we used in Appendix F. |
| Software Dependencies | No | The paper mentions "scikit-learn library [15]" but does not specify a version number or other software dependencies with their versions. |
| Experiment Setup | Yes | All experiments are conducted on 22 Kaggle competitions from MLE-bench Lite [16] using three random seeds, unless otherwise specified. Here, we use an agent Atest, which takes the task description and the final solution as input, and outputs the code that incorporates loading test sample and creating a submission file (see Appendix E for details). MLE-STAR begins by retrieving four model candidates. MLE-STAR refines for four inner loops, while exploring four outer loops. For ensemble, MLE-STAR generates two solutions in parallel, and explore ensemble strategies for five rounds. Following the MLE-bench s setup, we set a maximum time limit of 24 hours for a fair comparison (see computation analysis in Appendix F). |