Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Robust Calibration with Multi-domain Temperature Scaling
Authors: Yaodong Yu, Stephen Bates, Yi Ma, Michael Jordan
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments on three benchmark data sets, we ο¬nd our proposed method outperforms existing methods as measured on both in-distribution and out-of-distribution test sets. |
| Researcher Affiliation | Academia | Yaodong Yu University of California, Berkeley Stephen Bates University of California, Berkeley Yi Ma University of California, Berkeley Michael I. Jordan University of California, Berkeley |
| Pseudocode | Yes | A presentation of the algorithm in pseudocode can be found in Algorithm 1, Appendix A. |
| Open Source Code | Yes | Our code is available at https://github. com/yaodongyu/MDTS. |
| Open Datasets | Yes | We evaluate different calibration methods on three datasets, Image Net-C [Hendrycks and Dietterich, 2019], WILDS-Rx Rx1 [Koh et al., 2021], and GLDv2 [Weyand et al., 2020]. |
| Dataset Splits | Yes | For every domain k, we learn temperature ΛTk by applying temperature scaling on validation data Dk = {(xi,k, yi,k)}nk i=1 from k-th domain... For all datasets, we randomly sample half of the data from in-distribution domains for calibrating models and use the remaining samples for In D ECE evaluation. |
| Hardware Specification | Yes | All experiments are conducted on an NVIDIA A100 GPU. |
| Software Dependencies | No | The implementations are mainly based on scikit-learn [Pedregosa et al., 2011]. However, no specific version number for scikit-learn is provided. |
| Experiment Setup | Yes | We apply SGD optimizer to training the models on training datasets. We set the bin size as 100 for Image Net-C, and set bin size as 20 for WILDS-Rx Rx1 and GLDv2. We use grid search (on In D domains) to select hyperparameters for Ridge, Huber, KRR, and KNN. |