reproducibilityindex.ai

Calibration tests in multi-class classification: A unifying framework

Authors: David Widmann, Fredrik Lindsten, Dave Zachariah

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose and evaluate empirically different consistent and unbiased estimators for a speciﬁc class of measures based on matrix-valued kernels. We conduct experiments to conﬁrm the derived theoretical properties of the proposed calibration error estimators empirically and to compare them with a standard histogram-regression based estimator of the ECE, denoted by [ ECE.4
Researcher Affiliation	Academia	David Widmann Department of Information Technology Uppsala University, Sweden david.widmann@it.uu.se Fredrik Lindsten Division of Statistics and Machine Learning Linköping University, Sweden fredrik.lindsten@liu.se Dave Zachariah Department of Information Technology Uppsala University, Sweden dave.zachariah@it.uu.se
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	To facilitate multi-class calibration evaluation we provide the Julia packages Consistency Resampling.jl (Widmann, 2019c), Calibration Errors.jl (Widmann, 2019a), and Calibration Tests.jl (Widmann, 2019b) for consistency resampling, calibration error estimation, and calibration tests, respectively. The implementation of the experiments is available online at https://github.com/devmotion/ Calibration Paper.
Open Datasets	No	We construct synthetic data sets {(g(Xi), Yi)}n i=1 of 250 labeled predictions for m = 10 classes from three generative models. For each model we ﬁrst sample predictions g(Xi) Dir(0.1, . . . , 0.1), and then simulate corresponding labels Yi conditionally on g(Xi) from M1: Cat(g(Xi)), M2: 0.5 Cat(g(Xi)) + 0.5 Cat(1, 0, . . . , 0), M3: Cat(0.1, . . . , 0.1),
Dataset Splits	No	We construct synthetic data sets {(g(Xi), Yi)}n i=1 of 250 labeled predictions for m = 10 classes from three generative models. Consider the task of estimating the calibration error of model g using a validation set D = {(Xi, Yi)}n i=1 of n i.i.d. random pairs of inputs and labels that are distributed according to (X, Y ).
Hardware Specification	No	The paper does not specify any hardware details such as specific GPU or CPU models, or computational resources used for the experiments.
Software Dependencies	Yes	To facilitate multi-class calibration evaluation we provide the Julia packages Consistency Resampling.jl (Widmann, 2019c), Calibration Errors.jl (Widmann, 2019a), and Calibration Tests.jl (Widmann, 2019b) for consistency resampling, calibration error estimation, and calibration tests, respectively.
Experiment Setup	Yes	For simplicity, we use the matrix-valued kernel k(x, y) = exp ( x y /ν)I10, where the kernel bandwidth ν > 0 is chosen by the median heuristic. For each model we ﬁrst sample predictions g(Xi) Dir(0.1, . . . , 0.1), and then simulate corresponding labels Yi conditionally on g(Xi) from M1: Cat(g(Xi)), M2: 0.5 Cat(g(Xi)) + 0.5 Cat(1, 0, . . . , 0), M3: Cat(0.1, . . . , 0.1),