Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On Joint Regularization and Calibration in Deep Ensembles
Authors: Laurits Fredsgaard, Mikkel N. Schmidt
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results demonstrate that jointly tuning the ensemble generally matches or improves performance, with significant variation in effect size across different tasks and metrics. We conduct a series of experiments to empirically measure the ensemble optimality gap across three key areas: hyperparameter tuning, temperature scaling, and early stopping. |
| Researcher Affiliation | Academia | Laurits Fredsgaard EMAIL Department of Applied Mathematics and Computer Science Technical University of Denmark Mikkel N. Schmidt EMAIL Department of Applied Mathematics and Computer Science Technical University of Denmark |
| Pseudocode | No | The paper describes methods and strategies using formal definitions and text, but does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/lauritsf/ensemble-optimality-gap |
| Open Datasets | Yes | Image Classification (CIFAR-10 / WRN-16-4): Our first domain uses the CIFAR-10 dataset (Krizhevsky, 2009)... Graph Classification (NCI1 / GCN): As a contrasting setting with structured data, we used the NCI1 graph classification benchmark (Shervashidze et al., 2011; Wale et al., 2008)... Tabular Classification (Covertype / MLP): To cover a third modality and explore larger ensemble sizes, we include the Covertype dataset (Blackard & Dean, 1999) from the UCI repository. Text Classification (AG News / Bi LSTM): Our final domain involves text classification using the AG News dataset (Zhang et al., 2015)... |
| Dataset Splits | Yes | CIFAR-10 consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images, and we use the original test set for final performance evaluation in all experiments. (A.2.1) We randomly split the dataset into a training set (80%) and a test set (20%) using stratified sampling to ensure class balance, resulting in 3288 training graphs and 822 test graphs. (A.2.2) |
| Hardware Specification | Yes | All experiments were conducted on the LUMI supercomputer, where each ensemble was trained on a single Graphics Compute Die (GCD) on a LUMI-G node, which is equipped with AMD MI250x GPUs. |
| Software Dependencies | No | The paper mentions optimizers (SGD, Adam) and algorithms (L-BFGS, cosine annealing learning rate schedule) and the GPT-2 tokenizer, but does not provide specific version numbers for any software dependencies or libraries used. |
| Experiment Setup | Yes | Across all experiments, the batch size was set to 128. The ensemble size was M = 4 for the WRN-16-4 and GCN models, and M = 8 for the MLP and Bi LSTM models. For the weight decay tuning experiments, models were trained using SGD with a momentum of 0.9 and a cosine annealing learning rate schedule. We performed a grid search over a range of log-spaced weight decay values, always including 0. We used 100 epochs (200 for GCN), an initial learning rate of 0.1 (0.3 for MLP), and 5 random seeds (20 for GCN). |