Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration

Authors: Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, Peter Flach

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate improved probabilistic predictions according to multiple measures (confidence-ECE, classwise-ECE, log-loss, Brier score) across a wide range of datasets and classifiers.
Researcher Affiliation Academia Meelis Kull Department of Computer Science University of Tartu meelis.kull@ut.ee Miquel Perello-Nieto Department of Computer Science University of Bristol miquel.perellonieto@bris.ac.uk Markus Kängsepp Department of Computer Science University of Tartu markus.kangsepp@ut.ee Telmo Silva Filho Department of Statistics Universidade Federal da Paraíba telmo@de.ufpb.br Hao Song Department of Computer Science University of Bristol hao.song@bristol.ac.uk Peter Flach Department of Computer Science University of Bristol and The Alan Turing Institute peter.flach@bristol.ac.uk
Pseudocode No The paper describes the implementation details of the Dirichlet calibration (e.g., 'easily implemented with neural nets since it is equivalent to log-transforming the uncalibrated probabilities, followed by one linear layer and softmax') but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Full details and source code for training the models are in the Supplemental Material.
Open Datasets Yes Calibration methods were compared on 21 UCI datasets (abalone, balancescale, car, cleveland, dermatology, glass, iris, landsat-satellite, libras-movement, mfeat-karhunen, mfeat-morphological, mfeat-zernike, optdigits, page-blocks, pendigits, segment, shuttle, vehicle, vowel, waveform-5000, yeast) ... We used 3 datasets (CIFAR-10, CIFAR-100 and SVHN)
Dataset Splits Yes In each of the 21 11 = 231 settings we performed nested cross-validation to evaluate 8 calibration methods... We used 3-fold internal cross-validation to train the calibration maps within the 5 times 5-fold external cross-validation. ... For the latter we set aside 5,000 test instances for fitting the calibration map. On other models we followed [9], setting aside 5,000 training instances (6,000 in SVHN) for calibration purposes and training the models as in the original papers.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running the experiments (e.g., specific GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions software like Keras, JAX, and scikit-learn, but it does not specify version numbers for these software components. For example, it states: 'we used Keras [5] in the neural experiments.'
Experiment Setup Yes We used 3-fold internal cross-validation to train the calibration maps within the 5 times 5-fold external cross-validation. ... For calibration methods with hyperparameters we used the training folds of the classifier to choose the hyperparameter values with the lowest log-loss. ... For calibration methods with hyperparameters we used 5-fold cross-validation on the validation set to find optimal regularisation parameters. ... Fitting of Dirichlet calibration maps is performed by minimising log-loss, and by adding ODIR regularisation terms to the loss function as follows: ... where λ,µ are hyper-parameters tunable with internal cross-validation on the validation data.