Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration
Authors: Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, Peter Flach
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate improved probabilistic predictions according to multiple measures (confidence-ECE, classwise-ECE, log-loss, Brier score) across a wide range of datasets and classifiers. |
| Researcher Affiliation | Academia | Meelis Kull Department of Computer Science University of Tartu meelis.kull@ut.ee Miquel Perello-Nieto Department of Computer Science University of Bristol miquel.perellonieto@bris.ac.uk Markus Kängsepp Department of Computer Science University of Tartu markus.kangsepp@ut.ee Telmo Silva Filho Department of Statistics Universidade Federal da Paraíba telmo@de.ufpb.br Hao Song Department of Computer Science University of Bristol hao.song@bristol.ac.uk Peter Flach Department of Computer Science University of Bristol and The Alan Turing Institute peter.flach@bristol.ac.uk |
| Pseudocode | No | The paper describes the implementation details of the Dirichlet calibration (e.g., 'easily implemented with neural nets since it is equivalent to log-transforming the uncalibrated probabilities, followed by one linear layer and softmax') but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Full details and source code for training the models are in the Supplemental Material. |
| Open Datasets | Yes | Calibration methods were compared on 21 UCI datasets (abalone, balancescale, car, cleveland, dermatology, glass, iris, landsat-satellite, libras-movement, mfeat-karhunen, mfeat-morphological, mfeat-zernike, optdigits, page-blocks, pendigits, segment, shuttle, vehicle, vowel, waveform-5000, yeast) ... We used 3 datasets (CIFAR-10, CIFAR-100 and SVHN) |
| Dataset Splits | Yes | In each of the 21 11 = 231 settings we performed nested cross-validation to evaluate 8 calibration methods... We used 3-fold internal cross-validation to train the calibration maps within the 5 times 5-fold external cross-validation. ... For the latter we set aside 5,000 test instances for fitting the calibration map. On other models we followed [9], setting aside 5,000 training instances (6,000 in SVHN) for calibration purposes and training the models as in the original papers. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for running the experiments (e.g., specific GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions software like Keras, JAX, and scikit-learn, but it does not specify version numbers for these software components. For example, it states: 'we used Keras [5] in the neural experiments.' |
| Experiment Setup | Yes | We used 3-fold internal cross-validation to train the calibration maps within the 5 times 5-fold external cross-validation. ... For calibration methods with hyperparameters we used the training folds of the classifier to choose the hyperparameter values with the lowest log-loss. ... For calibration methods with hyperparameters we used 5-fold cross-validation on the validation set to find optimal regularisation parameters. ... Fitting of Dirichlet calibration maps is performed by minimising log-loss, and by adding ODIR regularisation terms to the loss function as follows: ... where λ,µ are hyper-parameters tunable with internal cross-validation on the validation data. |