Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration

Authors: Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, Peter Flach

NeurIPS 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate improved probabilistic predictions according to multiple measures (confidence-ECE, classwise-ECE, log-loss, Brier score) across a wide range of datasets and classifiers.
Researcher Affiliation Academia Meelis Kull Department of Computer Science University of Tartu EMAIL Miquel Perello-Nieto Department of Computer Science University of Bristol EMAIL Markus Kängsepp Department of Computer Science University of Tartu EMAIL Telmo Silva Filho Department of Statistics Universidade Federal da Paraíba EMAIL Hao Song Department of Computer Science University of Bristol EMAIL Peter Flach Department of Computer Science University of Bristol and The Alan Turing Institute EMAIL
Pseudocode No The paper describes the implementation details of the Dirichlet calibration (e.g., 'easily implemented with neural nets since it is equivalent to log-transforming the uncalibrated probabilities, followed by one linear layer and softmax') but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Full details and source code for training the models are in the Supplemental Material.
Open Datasets Yes Calibration methods were compared on 21 UCI datasets (abalone, balancescale, car, cleveland, dermatology, glass, iris, landsat-satellite, libras-movement, mfeat-karhunen, mfeat-morphological, mfeat-zernike, optdigits, page-blocks, pendigits, segment, shuttle, vehicle, vowel, waveform-5000, yeast) ... We used 3 datasets (CIFAR-10, CIFAR-100 and SVHN)
Dataset Splits Yes In each of the 21 11 = 231 settings we performed nested cross-validation to evaluate 8 calibration methods... We used 3-fold internal cross-validation to train the calibration maps within the 5 times 5-fold external cross-validation. ... For the latter we set aside 5,000 test instances for fitting the calibration map. On other models we followed [9], setting aside 5,000 training instances (6,000 in SVHN) for calibration purposes and training the models as in the original papers.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running the experiments (e.g., specific GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions software like Keras, JAX, and scikit-learn, but it does not specify version numbers for these software components. For example, it states: 'we used Keras [5] in the neural experiments.'
Experiment Setup Yes We used 3-fold internal cross-validation to train the calibration maps within the 5 times 5-fold external cross-validation. ... For calibration methods with hyperparameters we used the training folds of the classifier to choose the hyperparameter values with the lowest log-loss. ... For calibration methods with hyperparameters we used 5-fold cross-validation on the validation set to find optimal regularisation parameters. ... Fitting of Dirichlet calibration maps is performed by minimising log-loss, and by adding ODIR regularisation terms to the loss function as follows: ... where λ,µ are hyper-parameters tunable with internal cross-validation on the validation data.