Confidence Calibration of Classifiers with Many Classes
Authors: Adrien Le Coz, Stéphane Herbin, Faouzi Adjed
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on numerous neural networks used for image or text classification and show that it significantly enhances existing calibration methods. Our code can be accessed at the following link: https://github.com/allglc/tva-calibration. |
| Researcher Affiliation | Collaboration | Adrien Le Coz1,2,3 Stéphane Herbin2,3 Faouzi Adjed1 1IRT System X 2ONERA DTIS 3 Paris-Saclay University |
| Pseudocode | Yes | Algorithm 1 Top-versus-All approach to confidence calibration |
| Open Source Code | Yes | Our code can be accessed at the following link: https://github.com/allglc/tva-calibration. |
| Open Datasets | Yes | For image classification, we used the datasets CIFAR-10 (C10) and CIFAR-100 (C100) [27] with 10 and 100 classes respectively, Image Net (IN) [7] with 1000 classes, and Image Net-21K (IN21K) [54] with 10450 classes. For text classification, we used Amazon Fine Foods (AFF) [43] and Dyna Sent (DF) [51] for sentiment analysis with 3 classes, MNLI [66] for natural language inference with 3 classes, and Yahoo Answers (YA) [73] for topic classification on 10 classes. |
| Dataset Splits | Yes | Experiment results are averaged over five random seeds that randomly split the concatenation of the original validation and test sets into calibration and test sets. |
| Hardware Specification | Yes | Computing time (in seconds) of the calibration on Image Net, using one NVIDIA V100 GPU. |
| Software Dependencies | Yes | We used Py Torch 2.0.0 [3] (BSD-style license). |
| Experiment Setup | Yes | For HB, we tested equal-size and equal-mass bins, and chose the best variant for each case. All hyperparameters were kept at their default values (10 bins for HB). |