reproducibilityindex.ai

Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration

Authors: Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, Peter Flach

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate improved probabilistic predictions according to multiple measures (conﬁdence-ECE, classwise-ECE, log-loss, Brier score) across a wide range of datasets and classiﬁers.
Researcher Affiliation	Academia	Meelis Kull Department of Computer Science University of Tartu meelis.kull@ut.ee Miquel Perello-Nieto Department of Computer Science University of Bristol miquel.perellonieto@bris.ac.uk Markus Kängsepp Department of Computer Science University of Tartu markus.kangsepp@ut.ee Telmo Silva Filho Department of Statistics Universidade Federal da Paraíba telmo@de.ufpb.br Hao Song Department of Computer Science University of Bristol hao.song@bristol.ac.uk Peter Flach Department of Computer Science University of Bristol and The Alan Turing Institute peter.flach@bristol.ac.uk
Pseudocode	No	The paper describes the implementation details of the Dirichlet calibration (e.g., 'easily implemented with neural nets since it is equivalent to log-transforming the uncalibrated probabilities, followed by one linear layer and softmax') but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Full details and source code for training the models are in the Supplemental Material.
Open Datasets	Yes	Calibration methods were compared on 21 UCI datasets (abalone, balancescale, car, cleveland, dermatology, glass, iris, landsat-satellite, libras-movement, mfeat-karhunen, mfeat-morphological, mfeat-zernike, optdigits, page-blocks, pendigits, segment, shuttle, vehicle, vowel, waveform-5000, yeast) ... We used 3 datasets (CIFAR-10, CIFAR-100 and SVHN)
Dataset Splits	Yes	In each of the 21 11 = 231 settings we performed nested cross-validation to evaluate 8 calibration methods... We used 3-fold internal cross-validation to train the calibration maps within the 5 times 5-fold external cross-validation. ... For the latter we set aside 5,000 test instances for ﬁtting the calibration map. On other models we followed [9], setting aside 5,000 training instances (6,000 in SVHN) for calibration purposes and training the models as in the original papers.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for running the experiments (e.g., specific GPU/CPU models, memory, or cloud instance types).
Software Dependencies	No	The paper mentions software like Keras, JAX, and scikit-learn, but it does not specify version numbers for these software components. For example, it states: 'we used Keras [5] in the neural experiments.'
Experiment Setup	Yes	We used 3-fold internal cross-validation to train the calibration maps within the 5 times 5-fold external cross-validation. ... For calibration methods with hyperparameters we used the training folds of the classiﬁer to choose the hyperparameter values with the lowest log-loss. ... For calibration methods with hyperparameters we used 5-fold cross-validation on the validation set to ﬁnd optimal regularisation parameters. ... Fitting of Dirichlet calibration maps is performed by minimising log-loss, and by adding ODIR regularisation terms to the loss function as follows: ... where λ,µ are hyper-parameters tunable with internal cross-validation on the validation data.