reproducibilityindex.ai

A Consistent and Differentiable Lp Canonical Calibration Error Estimator

Authors: Teodora Popordanoska, Raphael Sayer, Matthew Blaschko

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results validate the correctness of our estimator, and demonstrate its utility in canonical calibration error estimation and calibration error regularized risk minimization.
Researcher Affiliation	Academia	Teodora Popordanoska ESAT-PSI, KU Leuven teodora.popordanoska@kuleuven.be Raphael Sayer University of Tübingen raphael.sayer@uni-tuebingen.de Matthew B. Blaschko ESAT-PSI, KU Leuven matthew.blaschko@esat.kuleuven.be
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/tpopordanoska/ece-kde.
Open Datasets	Yes	The Kather dataset [Kather et al., 2016] consists of 5000 histological images of human colorectal cancer and it has eight different classes of tissue. Derma MNIST [Yang et al., 2021] is a pre-processed version of the HAM10000 dataset [Tschandl et al., 2018], containing 10015 dermatoscopic images of skin lesions, categorized in seven classes. Both datasets have been collected in accordance with the Declaration of Helsinki. According to standard practice in related works, we trained Res Net [He et al., 2016], Res Net with stochastic depth (SD) [Huang et al., 2016], Dense Net [Huang et al., 2017] and Wide Res Net [Zagoruyko and Komodakis, 2016] networks also on CIFAR-10/100 [Krizhevsky, 2009].
Dataset Splits	Yes	We use 45000 images for training on the CIFAR datasets, 4000 for Kather and 7007 for Derma MNIST. ... The λ parameter for weighting the calibration error w.r.t the loss is typically chosen via cross-validation or using a holdout validation set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running the experiments. It only lists training times.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup	Yes	For our experiments we choose the bandwidth from a list of possible values by maximizing the leave-one-out likelihood (LOO MLE). The λ parameter for weighting the calibration error w.r.t the loss is typically chosen via cross-validation or using a holdout validation set. We found that for KDE-XE, values of λ [0.001, 0.2] provide a good trade-off in terms of accuracy and calibration error. The p parameter is selected depending on the desired Lp calibration error and the corresponding theoretical guarantees. The rest of the hyperparameters for training are set as proposed in the corresponding papers for the architectures we benchmark. In particular, for the CIFAR-10/100 datasets we used a batch size of 64 for Dense Net and 128 for the other architectures. For the medical datasets, we used a batch size of 64, due to their smaller size.