Calibration tests in multi-class classification: A unifying framework
Authors: David Widmann, Fredrik Lindsten, Dave Zachariah
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose and evaluate empirically different consistent and unbiased estimators for a specific class of measures based on matrix-valued kernels. We conduct experiments to confirm the derived theoretical properties of the proposed calibration error estimators empirically and to compare them with a standard histogram-regression based estimator of the ECE, denoted by [ ECE.4 |
| Researcher Affiliation | Academia | David Widmann Department of Information Technology Uppsala University, Sweden david.widmann@it.uu.se Fredrik Lindsten Division of Statistics and Machine Learning Linköping University, Sweden fredrik.lindsten@liu.se Dave Zachariah Department of Information Technology Uppsala University, Sweden dave.zachariah@it.uu.se |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | To facilitate multi-class calibration evaluation we provide the Julia packages Consistency Resampling.jl (Widmann, 2019c), Calibration Errors.jl (Widmann, 2019a), and Calibration Tests.jl (Widmann, 2019b) for consistency resampling, calibration error estimation, and calibration tests, respectively. The implementation of the experiments is available online at https://github.com/devmotion/ Calibration Paper. |
| Open Datasets | No | We construct synthetic data sets {(g(Xi), Yi)}n i=1 of 250 labeled predictions for m = 10 classes from three generative models. For each model we first sample predictions g(Xi) Dir(0.1, . . . , 0.1), and then simulate corresponding labels Yi conditionally on g(Xi) from M1: Cat(g(Xi)), M2: 0.5 Cat(g(Xi)) + 0.5 Cat(1, 0, . . . , 0), M3: Cat(0.1, . . . , 0.1), |
| Dataset Splits | No | We construct synthetic data sets {(g(Xi), Yi)}n i=1 of 250 labeled predictions for m = 10 classes from three generative models. Consider the task of estimating the calibration error of model g using a validation set D = {(Xi, Yi)}n i=1 of n i.i.d. random pairs of inputs and labels that are distributed according to (X, Y ). |
| Hardware Specification | No | The paper does not specify any hardware details such as specific GPU or CPU models, or computational resources used for the experiments. |
| Software Dependencies | Yes | To facilitate multi-class calibration evaluation we provide the Julia packages Consistency Resampling.jl (Widmann, 2019c), Calibration Errors.jl (Widmann, 2019a), and Calibration Tests.jl (Widmann, 2019b) for consistency resampling, calibration error estimation, and calibration tests, respectively. |
| Experiment Setup | Yes | For simplicity, we use the matrix-valued kernel k(x, y) = exp ( x y /ν)I10, where the kernel bandwidth ν > 0 is chosen by the median heuristic. For each model we first sample predictions g(Xi) Dir(0.1, . . . , 0.1), and then simulate corresponding labels Yi conditionally on g(Xi) from M1: Cat(g(Xi)), M2: 0.5 Cat(g(Xi)) + 0.5 Cat(1, 0, . . . , 0), M3: Cat(0.1, . . . , 0.1), |