Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving Multi-Class Calibration through Normalization-Aware Isotonic Techniques

Authors: Alon Arad, Saharon Rosset

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation on a variety of text and image classification datasets across different model architectures reveals that our approach consistently improves negative log-likelihood (NLL) and expected calibration error (ECE) metrics.
Researcher Affiliation	Academia	1Department of Statistics, Tel Aviv University. Correspondence to: Alon Arad <alonarad1.mail.tau.ac.il>, Saharon Rosset <EMAIL>.
Pseudocode	Yes	Algorithm 1 MCMC Blockwise Normalized Aware Flattened Isotonic Optimization; Algorithm 2 2-D Grid Sorted Cumulative Isotonic Maximal Upper Set Algorithm; Algorithm 3 2-D Grid Sorted Cumulative Isotonic Maximal Upper Set Algorithm
Open Source Code	No	The paper mentions using external libraries and frameworks like 'Focal Calibration Library', 'Hugging Face', and 'Pytorch image models' for models and datasets, but does not provide any explicit statement about releasing the authors' own implementation code or a link to their own code repository for the methodology described in this paper.
Open Datasets	Yes	CIFAR-10, CIFAR-100 (Krizhevsky, 2009), Food-101 (Bossard et al., 2014) and Image Net-1k (Russakovsky et al., 2015) for image classification, and R52, NG20, and Yelp Review (Zhang et al., 2015) for text classification.
Dataset Splits	Yes	Table 3. Comparison Study Dataset Descriptions Dataset Type # Classes # Models Used Validation Set Size Test Set Size CIFAR-10 Image Classification 10 4 5k 10k CIFAR-100 Image Classification 100 13 5k / 2.5k 10k / 7.5k Food-101 Image Classification 101 8 6.3k 18.6k Imagenet-1k Image Classification 1000 8 12.5k 37.5k R52 Text Classification 52 10 0.6k 1.9k NG20 Text Classification 20 10 1.7k 5.3k Yelp Review Text Classification 5 22 12.5k 37.5k. All BERT-based models were trained with an evaluation set comprising 10% of the data for 5 epochs. Evaluation set size reduced to 1% of the data.
Hardware Specification	No	All timing results were obtained on a standard single-machine setup without GPU acceleration. This statement is too general and does not specify any particular CPU model, memory, or other hardware components to be considered specific.
Software Dependencies	No	The paper mentions the use of 'GloVe 6B with 300-dimensional embeddings' and general frameworks like 'Hugging Face' and 'Pytorch image models', but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup	Yes	All BERT-based models were trained with an evaluation set comprising 10% of the data for 5 epochs, using a learning rate of 1.5e-05 and a batch size of 16. For simpler neural network models (Fast Text, Kim-CNN, and SWEM-Concat), the best-performing model was chosen from the following parameter grid: Batch size (bs): [16, 32] Learning rate (lr): [1e-3, 5e-3, 1e-2]. Naive Bayes (alpha): [0.01, 0.05, 0.1, 1.0, 10.0]. SVM (C): [0.1, 1, 3, 10, 100]. In our experiments, we ran the MCMC algorithm with beta = 200, a maximum of number of 10^5 iterations, and an early stopping criterion triggered if the best likelihood did not improve for 10^4 iterations.