Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning

Authors: Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, Frank Hutter

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We verify the improvements by these additions in an extensive experimental study on 39 Auto ML benchmark datasets.
Researcher Affiliation	Collaboration	Matthias Feurer1 EMAIL Katharina Eggensperger1 EMAIL Stefan Falkner2 EMAIL Marius Lindauer3 EMAIL Frank Hutter1,2 EMAIL 1Department of Computer Science, Albert-Ludwigs-Universit at Freiburg 2Bosch Center for Articial Intelligence, Renningen, Germany 3Institute of Information Processing, Leibniz University Hannover
Pseudocode	Yes	Appendix A. Additional pseudo-code We give pseudo-code for computing the estimated generalization error of P across all metadatasets Dmeta for K-fold cross-validation in Algorithm 2 and successive halving in Algorithm 3. Algorithm 2: Estimating the generalization error of a portfolio with K-Fold Cross Validation ... Algorithm 3: Estimating the generalization error of a portfolio with Successive Halving
Open Source Code	Yes	We provide scripts for reproducing all our experimental results at https://github.com/automl/ASKL2.0_experiments and provide a clean integration of our methods into the ocial Auto-sklearn repository.
Open Datasets	Yes	For Dtest, we rely on 39 datasets selected for the Auto ML benchmark proposed by Gijsbers et al. (2019), which consists of datasets for comparing classiers (Bischl et al., 2021) and datasets from the Auto ML challenges (Guyon et al., 2019). We collected the meta datasets Dmeta based on Open ML (Vanschoren et al., 2014) using the Open ML-Python API (Feurer et al., 2021).
Dataset Splits	Yes	For all datasets, we use a single holdout test set of 33.33%, which is dened by the corresponding Open ML task. The remaining 66.66% are the training data of our Auto ML systems, which handle further splits for model selection themselves based on the chosen model selection strategy. ... We used the pre-dened 1h8c setting, which divides each dataset into ten folds and gives each framework one hour on eight CPU cores to produce a nal model.
Hardware Specification	Yes	All experiments were conducted on a compute cluster with machines equipped with 2 Intel Xeon Gold 6242 CPUs with 2.8GHz (32 cores) and 192 GB RAM, running Ubuntu 20.04.01.
Software Dependencies	Yes	We implemented the Auto ML systems and experiments in the Python3 programming language, using numpy (Harris et al., 2020), scipy (Virtanen et al., 2020), scikit-learn (Pedregosa et al., 2011), pandas (Wes Mc Kinney, 2010; Reback et al., 2021), and matplotlib (Hunter, 2007). We used version 0.12.6 of the Auto-sklearn Python package for the experiments and added Auto-sklearn 2.0 functionality in version 0.12.7 which we then used for the Auto ML benchmark. ... Table 17: Package Versions: Auto-sklearn 2.0 0.12.7, Auto-sklearn 1.0 0.12.6, Auto-WEKA 2.6.3, TPOT 0.11.7, H2O Auto ML 3.32.1.4, Tuned Random Forest 0.24.2, Auto ML benchmark 973de79617e68a881dcc640842ea1d21dfd4b36c
Experiment Setup	Yes	We always report results averaged across 10 repetitions to account for randomness and report the mean and standard deviation over these repetitions. ... We conducted all experiments using ensemble selection, and we constructed ensembles of size 50 with replacement. ... We also limit the time and memory for each ML pipeline evaluation. For the time limit, we allow for at most 1/10 of the optimization budget, while for the memory, we allow the pipeline 4GB before forcefully terminating the execution. ... We used the same hyperparameters for all experiments. First, we set to eta = 4. Next, we had to choose the minimal and maximal budgets assigned to each algorithm. For the treebased methods we chose to go from 32 to 512, while for the linear models (SGD and passive aggressive) we chose 64 as the minimal budget and 1024 as the maximal budget. ... Table 18: Conguration space for Auto-sklearn 2.0 using only iterative models and only preprocessing to transform data into a format that can be usefully employed by the dierent classication algorithms. The nal column (log) states whether we actually search log10(λ).