Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Joint Regularization and Calibration in Deep Ensembles

Authors: Laurits Fredsgaard, Mikkel N. Schmidt

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that jointly tuning the ensemble generally matches or improves performance, with significant variation in effect size across different tasks and metrics. We conduct a series of experiments to empirically measure the ensemble optimality gap across three key areas: hyperparameter tuning, temperature scaling, and early stopping.
Researcher Affiliation Academia Laurits Fredsgaard EMAIL Department of Applied Mathematics and Computer Science Technical University of Denmark Mikkel N. Schmidt EMAIL Department of Applied Mathematics and Computer Science Technical University of Denmark
Pseudocode No The paper describes methods and strategies using formal definitions and text, but does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/lauritsf/ensemble-optimality-gap
Open Datasets Yes Image Classification (CIFAR-10 / WRN-16-4): Our first domain uses the CIFAR-10 dataset (Krizhevsky, 2009)... Graph Classification (NCI1 / GCN): As a contrasting setting with structured data, we used the NCI1 graph classification benchmark (Shervashidze et al., 2011; Wale et al., 2008)... Tabular Classification (Covertype / MLP): To cover a third modality and explore larger ensemble sizes, we include the Covertype dataset (Blackard & Dean, 1999) from the UCI repository. Text Classification (AG News / Bi LSTM): Our final domain involves text classification using the AG News dataset (Zhang et al., 2015)...
Dataset Splits Yes CIFAR-10 consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images, and we use the original test set for final performance evaluation in all experiments. (A.2.1) We randomly split the dataset into a training set (80%) and a test set (20%) using stratified sampling to ensure class balance, resulting in 3288 training graphs and 822 test graphs. (A.2.2)
Hardware Specification Yes All experiments were conducted on the LUMI supercomputer, where each ensemble was trained on a single Graphics Compute Die (GCD) on a LUMI-G node, which is equipped with AMD MI250x GPUs.
Software Dependencies No The paper mentions optimizers (SGD, Adam) and algorithms (L-BFGS, cosine annealing learning rate schedule) and the GPT-2 tokenizer, but does not provide specific version numbers for any software dependencies or libraries used.
Experiment Setup Yes Across all experiments, the batch size was set to 128. The ensemble size was M = 4 for the WRN-16-4 and GCN models, and M = 8 for the MLP and Bi LSTM models. For the weight decay tuning experiments, models were trained using SGD with a momentum of 0.9 and a cosine annealing learning rate schedule. We performed a grid search over a range of log-spaced weight decay values, always including 0. We used 100 epochs (200 for GCN), an initial learning rate of 0.1 (0.3 for MLP), and 5 random seeds (20 for GCN).