Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Multi-Class Calibration through Normalization-Aware Isotonic Techniques

Authors: Alon Arad, Saharon Rosset

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluation on a variety of text and image classification datasets across different model architectures reveals that our approach consistently improves negative log-likelihood (NLL) and expected calibration error (ECE) metrics.
Researcher Affiliation Academia 1Department of Statistics, Tel Aviv University. Correspondence to: Alon Arad <alonarad1.mail.tau.ac.il>, Saharon Rosset <EMAIL>.
Pseudocode Yes Algorithm 1 MCMC Blockwise Normalized Aware Flattened Isotonic Optimization; Algorithm 2 2-D Grid Sorted Cumulative Isotonic Maximal Upper Set Algorithm; Algorithm 3 2-D Grid Sorted Cumulative Isotonic Maximal Upper Set Algorithm
Open Source Code No The paper mentions using external libraries and frameworks like 'Focal Calibration Library', 'Hugging Face', and 'Pytorch image models' for models and datasets, but does not provide any explicit statement about releasing the authors' own implementation code or a link to their own code repository for the methodology described in this paper.
Open Datasets Yes CIFAR-10, CIFAR-100 (Krizhevsky, 2009), Food-101 (Bossard et al., 2014) and Image Net-1k (Russakovsky et al., 2015) for image classification, and R52, NG20, and Yelp Review (Zhang et al., 2015) for text classification.
Dataset Splits Yes Table 3. Comparison Study Dataset Descriptions Dataset Type # Classes # Models Used Validation Set Size Test Set Size CIFAR-10 Image Classification 10 4 5k 10k CIFAR-100 Image Classification 100 13 5k / 2.5k 10k / 7.5k Food-101 Image Classification 101 8 6.3k 18.6k Imagenet-1k Image Classification 1000 8 12.5k 37.5k R52 Text Classification 52 10 0.6k 1.9k NG20 Text Classification 20 10 1.7k 5.3k Yelp Review Text Classification 5 22 12.5k 37.5k. All BERT-based models were trained with an evaluation set comprising 10% of the data for 5 epochs. Evaluation set size reduced to 1% of the data.
Hardware Specification No All timing results were obtained on a standard single-machine setup without GPU acceleration. This statement is too general and does not specify any particular CPU model, memory, or other hardware components to be considered specific.
Software Dependencies No The paper mentions the use of 'GloVe 6B with 300-dimensional embeddings' and general frameworks like 'Hugging Face' and 'Pytorch image models', but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup Yes All BERT-based models were trained with an evaluation set comprising 10% of the data for 5 epochs, using a learning rate of 1.5e-05 and a batch size of 16. For simpler neural network models (Fast Text, Kim-CNN, and SWEM-Concat), the best-performing model was chosen from the following parameter grid: Batch size (bs): [16, 32] Learning rate (lr): [1e-3, 5e-3, 1e-2]. Naive Bayes (alpha): [0.01, 0.05, 0.1, 1.0, 10.0]. SVM (C): [0.1, 1, 3, 10, 100]. In our experiments, we ran the MCMC algorithm with beta = 200, a maximum of number of 10^5 iterations, and an early stopping criterion triggered if the best likelihood did not improve for 10^4 iterations.