Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

Authors: Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Tony Hu, Huzefa Rangwala, Ying Nian Wu, Yuyang (Bernie) Wang, George Karypis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effectiveness of MLZero against state-of-the-art ML and coding agents, we conducted extensive experiments across multiple benchmarks and datasets. We ﬁrst assess performance on MLEbench Lite [10] with 21 diverse Kaggle competitions, then proceeded to the Multimodal Auto ML Agent Benchmark for end-to-end evaluation across 25 diverse datasets spanning various modalities and ML tasks. We evaluate the performance using multiple metrics including success rate, average rank, relative time consumption, and solution quality. Additionally, we performed ablation studies to quantify the contribution of individual components within our proposed system. Finally, we conducted a detailed error analysis to identify and categorize failure cases across high-performance methods, providing insights into the robustness and limitations of each approach, followed by efﬁciency and robustness analysis examining token consumption, cost effectiveness, and robustness across different LLM backbones and under various noise conditions.
Researcher Affiliation	Industry	Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, George Karypis Amazon Web Services EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 File Grouping 1: procedure FILE GROUPING(ﬁles) 2: depth Folders Map from depth to set of folders First pass: analyze folder structure 3: for all file files do 4: paths Split Path(file.path) 5: for depth 0 to \|paths\| 2 do 6: depth Folders[depth].add(paths[depth]) 7: end for 8: end for Second pass: group ﬁles 9: groups Map from pattern to ﬁle list 10: for all file files do 11: paths Split Path(file.path) 12: pattern [] Build pattern using folder structure 13: for depth 0 to \|paths\| 2 do 14: if \|depth Folders[depth]\| δ then 15: pattern.append(paths[depth]) Use actual folder name 16: else 17: pattern.append( ) Use wildcard 18: end if 19: end for 20: pattern.append(Get Extension(file.name)) 21: groups[pattern].append(file) 22: end for 23: return groups 24: end procedure
Open Source Code	Yes	GLOBE Website: https://project-mlzero.github.io/ Github Git Hub: https://github.com/autogluon/autogluon-assistant ... Justiﬁcation: Please check Appendix C for the implementation details, and Appendix B for the design and prompts for each agent. Our source codes are also attached in supplemental material.
Open Datasets	Yes	We constructed the Multimodal Auto ML Agent Benchmark (MAAB) to address a critical gap in existing benchmarks: the ability to evaluate agents on raw, unprocessed multimodal data. To ensure fairness and diversity, all 25 datasets are sourced from reputable public repositories including Kaggle competitions, UCI Machine Learning Repository, and the BEIR benchmark suite. ... We open-source the complete benchmark including all datasets, evaluation scripts, and preprocessing speciﬁcations.
Dataset Splits	Yes	The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset, comprising 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. ... The Airbnb Melbourne Dataset (airbnb) [1, 24, 67] consists of 22,895 listings, partitioned into 18,316 training samples and 4,579 testing samples.
Hardware Specification	Yes	All experiments were conducted on an AWS EC2 p4d.24xlarge instance equipped with 8 NVIDIA A100 (40GB) GPUs and 96 v CPUs.
Software Dependencies	Yes	Python version: 3.11 ... Tool Registry for Tabular Tasks { "name": "autogluon.tabular", "version": "1.2.0", ... Tool Registry for Multimodal Tasks { "name": "autogluon.multimodal", "version": "1.2.0", ... Tool Registry for Time Series Tasks { "name": "autogluon.timeseries", "version": "1.2.0", ... Tool Registry for Retrieval Tasks { "name": "Flag Embedding", "version": "1.3.4"
Experiment Setup	Yes	Each agent was assigned a 3-hour time limit per dataset to produce results. By default, MLZero uses Claude 3.7 Sonnet as its underlying LLM. ... temperature 0 for planning and ﬁle reading tasks, while employing temperature 0.5 for coding. All agents utilize a 65536-token context window with task-speciﬁc conﬁgurations. For the 8B conﬁguration of MLZero, we adjusted parameters to accommodate the smaller context window. Speciﬁcally, we set the context window to 8,192 tokens across all agents, reduced retrieval size to 3, and limited maximum tutorial length to 4,069 characters.