Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

Authors: Patara Trirat, Wonyong Jeong, Sung Ju Hwang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on seven downstream tasks using fourteen datasets show that Auto ML-Agent achieves a higher success rate in automating the full Auto ML process, yielding systems with good performance throughout the diverse domains. ... We demonstrate the superiority of the proposed Auto MLAgent framework through extensive experiments on seven downstream tasks using fourteen datasets.
Researcher Affiliation	Collaboration	Patara Trirat 1 Wonyong Jeong 1 Sung Ju Hwang 1 2 1Deep Auto.ai 2KAIST, Seoul, South Korea. Correspondence to: Sung Ju Hwang <EMAIL>. Deep Auto.ai appears to be an industry entity (indicated by email domain deepauto.ai), while KAIST is an academic institution (Korea Advanced Institute of Science and Technology). Since authors are affiliated with both types of institutions, it is classified as a collaboration.
Pseudocode	Yes	As depicted in Figure 2 and Algorithm 1. ... Algorithm 1 Overall Procedure of Auto ML-Agent
Open Source Code	Yes	We have made the source code available at https://github.com/deepauto-ai/automl-agent.
Open Datasets	Yes	Extensive experiments on seven downstream tasks using fourteen datasets... These datasets are chosen from different sources. ... Butterfly Image (Butterfly). ... The dataset is accessible at https://www.kaggle.com/datasets/phucthaiv02/butterfly-image-classification. ... Shopee-IET (Shopee). ... The dataset is available at https://www.kaggle.com/competitions/demo-shopee-iet-competition/data. ... Textual Entailment (Entail). ... We use the dataset provided by Guo et al. (2024a). ... Higher Education Students Performance (Student). ... This dataset can be found at https://archive.ics.uci.edu/dataset/856/higher+education+students+performance+evaluation. ... Cora and Citeseer. ... We use the version provided by Fey & Lenssen (2019).
Dataset Splits	Yes	Dataset Splitting: Split the dataset into training, validation, and testing sets (e.g., 80% training and 20% validation). ... In the (3) execution stage, the Data (Ad) and Model (Am) Agents decompose these plans and execute them via plan decomposition (PD) and prompting-based plan execution (Figure 2(b) and Line 13 16)... ... # TODO: Step 2. Create a train-valid-test split of the data by splitting the dataset into train_loader, valid_loader, and test_loader. # Here, the train_loader contains 70% of the dataset , the valid_loader contains 20% of the dataset , and the test_loader contains 10% of the dataset .
Hardware Specification	Yes	All experiments are conducted on an Ubuntu 22.04 LTS server equipped with eight NVIDIA A100 GPUs (CUDA 12.4) and Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz.
Software Dependencies	Yes	Except for the Ap that is implemented with Mixtral-8x7B (Mixtral-8x7B-Instruct-v0.1) (Jiang et al., 2024), we use GPT-4 (gpt-4o-2024-05-13) as the backbone model for all agents... All experiments are conducted on an Ubuntu 22.04 LTS server equipped with eight NVIDIA A100 GPUs (CUDA 12.4)...
Experiment Setup	Yes	For RAP (3.4), we set the number of plans P = 3 and the number of candidate models k = 3. ... For the constraint-free setting, a method can get a score of 0.5 (pass modeling) or 1.0 (pass deployment). For the constraint-aware setting, a method can get a score of 0.25 (pass modeling), 0.5 (pass deployment), 0.75 (partially pass the constraints), or 1.0 (pass all cases). ... We report the average scores from five independent runs for all evaluation metrics in Figure 4. ... optimizer = optim.Adam(model.parameters(), lr=0.00001)... num_epochs = 100 ... early_stop_patience = 10