Calibration tests in multi-class classification: A unifying framework

Authors: David Widmann, Fredrik Lindsten, Dave Zachariah

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose and evaluate empirically different consistent and unbiased estimators for a specific class of measures based on matrix-valued kernels. We conduct experiments to confirm the derived theoretical properties of the proposed calibration error estimators empirically and to compare them with a standard histogram-regression based estimator of the ECE, denoted by [ ECE.4
Researcher Affiliation Academia David Widmann Department of Information Technology Uppsala University, Sweden david.widmann@it.uu.se Fredrik Lindsten Division of Statistics and Machine Learning Linköping University, Sweden fredrik.lindsten@liu.se Dave Zachariah Department of Information Technology Uppsala University, Sweden dave.zachariah@it.uu.se
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes To facilitate multi-class calibration evaluation we provide the Julia packages Consistency Resampling.jl (Widmann, 2019c), Calibration Errors.jl (Widmann, 2019a), and Calibration Tests.jl (Widmann, 2019b) for consistency resampling, calibration error estimation, and calibration tests, respectively. The implementation of the experiments is available online at https://github.com/devmotion/ Calibration Paper.
Open Datasets No We construct synthetic data sets {(g(Xi), Yi)}n i=1 of 250 labeled predictions for m = 10 classes from three generative models. For each model we first sample predictions g(Xi) Dir(0.1, . . . , 0.1), and then simulate corresponding labels Yi conditionally on g(Xi) from M1: Cat(g(Xi)), M2: 0.5 Cat(g(Xi)) + 0.5 Cat(1, 0, . . . , 0), M3: Cat(0.1, . . . , 0.1),
Dataset Splits No We construct synthetic data sets {(g(Xi), Yi)}n i=1 of 250 labeled predictions for m = 10 classes from three generative models. Consider the task of estimating the calibration error of model g using a validation set D = {(Xi, Yi)}n i=1 of n i.i.d. random pairs of inputs and labels that are distributed according to (X, Y ).
Hardware Specification No The paper does not specify any hardware details such as specific GPU or CPU models, or computational resources used for the experiments.
Software Dependencies Yes To facilitate multi-class calibration evaluation we provide the Julia packages Consistency Resampling.jl (Widmann, 2019c), Calibration Errors.jl (Widmann, 2019a), and Calibration Tests.jl (Widmann, 2019b) for consistency resampling, calibration error estimation, and calibration tests, respectively.
Experiment Setup Yes For simplicity, we use the matrix-valued kernel k(x, y) = exp ( x y /ν)I10, where the kernel bandwidth ν > 0 is chosen by the median heuristic. For each model we first sample predictions g(Xi) Dir(0.1, . . . , 0.1), and then simulate corresponding labels Yi conditionally on g(Xi) from M1: Cat(g(Xi)), M2: 0.5 Cat(g(Xi)) + 0.5 Cat(1, 0, . . . , 0), M3: Cat(0.1, . . . , 0.1),