Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Joint Regularization and Calibration in Deep Ensembles

Authors: Laurits Fredsgaard, Mikkel N. Schmidt

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results demonstrate that jointly tuning the ensemble generally matches or improves performance, with significant variation in effect size across different tasks and metrics. We conduct a series of experiments to empirically measure the ensemble optimality gap across three key areas: hyperparameter tuning, temperature scaling, and early stopping.
Researcher Affiliation	Academia	Laurits Fredsgaard EMAIL Department of Applied Mathematics and Computer Science Technical University of Denmark Mikkel N. Schmidt EMAIL Department of Applied Mathematics and Computer Science Technical University of Denmark
Pseudocode	No	The paper describes methods and strategies using formal definitions and text, but does not contain explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/lauritsf/ensemble-optimality-gap
Open Datasets	Yes	Image Classification (CIFAR-10 / WRN-16-4): Our first domain uses the CIFAR-10 dataset (Krizhevsky, 2009)... Graph Classification (NCI1 / GCN): As a contrasting setting with structured data, we used the NCI1 graph classification benchmark (Shervashidze et al., 2011; Wale et al., 2008)... Tabular Classification (Covertype / MLP): To cover a third modality and explore larger ensemble sizes, we include the Covertype dataset (Blackard & Dean, 1999) from the UCI repository. Text Classification (AG News / Bi LSTM): Our final domain involves text classification using the AG News dataset (Zhang et al., 2015)...
Dataset Splits	Yes	CIFAR-10 consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images, and we use the original test set for final performance evaluation in all experiments. (A.2.1) We randomly split the dataset into a training set (80%) and a test set (20%) using stratified sampling to ensure class balance, resulting in 3288 training graphs and 822 test graphs. (A.2.2)
Hardware Specification	Yes	All experiments were conducted on the LUMI supercomputer, where each ensemble was trained on a single Graphics Compute Die (GCD) on a LUMI-G node, which is equipped with AMD MI250x GPUs.
Software Dependencies	No	The paper mentions optimizers (SGD, Adam) and algorithms (L-BFGS, cosine annealing learning rate schedule) and the GPT-2 tokenizer, but does not provide specific version numbers for any software dependencies or libraries used.
Experiment Setup	Yes	Across all experiments, the batch size was set to 128. The ensemble size was M = 4 for the WRN-16-4 and GCN models, and M = 8 for the MLP and Bi LSTM models. For the weight decay tuning experiments, models were trained using SGD with a momentum of 0.9 and a cosine annealing learning rate schedule. We performed a grid search over a range of log-spaced weight decay values, always including 0. We used 100 epochs (200 for GCN), an initial learning rate of 0.1 (0.3 for MLP), and 5 random seeds (20 for GCN).