Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

Authors: Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Tony Hu, Huzefa Rangwala, Ying Nian Wu, Yuyang (Bernie) Wang, George Karypis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of MLZero against state-of-the-art ML and coding agents, we conducted extensive experiments across multiple benchmarks and datasets. We first assess performance on MLEbench Lite [10] with 21 diverse Kaggle competitions, then proceeded to the Multimodal Auto ML Agent Benchmark for end-to-end evaluation across 25 diverse datasets spanning various modalities and ML tasks. We evaluate the performance using multiple metrics including success rate, average rank, relative time consumption, and solution quality. Additionally, we performed ablation studies to quantify the contribution of individual components within our proposed system. Finally, we conducted a detailed error analysis to identify and categorize failure cases across high-performance methods, providing insights into the robustness and limitations of each approach, followed by efficiency and robustness analysis examining token consumption, cost effectiveness, and robustness across different LLM backbones and under various noise conditions.
Researcher Affiliation Industry Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, George Karypis Amazon Web Services EMAIL EMAIL
Pseudocode Yes Algorithm 1 File Grouping 1: procedure FILE GROUPING(files) 2: depth Folders Map from depth to set of folders First pass: analyze folder structure 3: for all file files do 4: paths Split Path(file.path) 5: for depth 0 to |paths| 2 do 6: depth Folders[depth].add(paths[depth]) 7: end for 8: end for Second pass: group files 9: groups Map from pattern to file list 10: for all file files do 11: paths Split Path(file.path) 12: pattern [] Build pattern using folder structure 13: for depth 0 to |paths| 2 do 14: if |depth Folders[depth]| δ then 15: pattern.append(paths[depth]) Use actual folder name 16: else 17: pattern.append( ) Use wildcard 18: end if 19: end for 20: pattern.append(Get Extension(file.name)) 21: groups[pattern].append(file) 22: end for 23: return groups 24: end procedure
Open Source Code Yes GLOBE Website: https://project-mlzero.github.io/ Github Git Hub: https://github.com/autogluon/autogluon-assistant ... Justification: Please check Appendix C for the implementation details, and Appendix B for the design and prompts for each agent. Our source codes are also attached in supplemental material.
Open Datasets Yes We constructed the Multimodal Auto ML Agent Benchmark (MAAB) to address a critical gap in existing benchmarks: the ability to evaluate agents on raw, unprocessed multimodal data. To ensure fairness and diversity, all 25 datasets are sourced from reputable public repositories including Kaggle competitions, UCI Machine Learning Repository, and the BEIR benchmark suite. ... We open-source the complete benchmark including all datasets, evaluation scripts, and preprocessing specifications.
Dataset Splits Yes The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset, comprising 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. ... The Airbnb Melbourne Dataset (airbnb) [1, 24, 67] consists of 22,895 listings, partitioned into 18,316 training samples and 4,579 testing samples.
Hardware Specification Yes All experiments were conducted on an AWS EC2 p4d.24xlarge instance equipped with 8 NVIDIA A100 (40GB) GPUs and 96 v CPUs.
Software Dependencies Yes Python version: 3.11 ... Tool Registry for Tabular Tasks { "name": "autogluon.tabular", "version": "1.2.0", ... Tool Registry for Multimodal Tasks { "name": "autogluon.multimodal", "version": "1.2.0", ... Tool Registry for Time Series Tasks { "name": "autogluon.timeseries", "version": "1.2.0", ... Tool Registry for Retrieval Tasks { "name": "Flag Embedding", "version": "1.3.4"
Experiment Setup Yes Each agent was assigned a 3-hour time limit per dataset to produce results. By default, MLZero uses Claude 3.7 Sonnet as its underlying LLM. ... temperature 0 for planning and file reading tasks, while employing temperature 0.5 for coding. All agents utilize a 65536-token context window with task-specific configurations. For the 8B configuration of MLZero, we adjusted parameters to accommodate the smaller context window. Specifically, we set the context window to 8,192 tokens across all agents, reduced retrieval size to 3, and limited maximum tutorial length to 4,069 characters.