Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hidden Heterogeneity: When to Choose Similarity-Based Calibration

Authors: Kiri L. Wagstaff, Thomas G Dietterich

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments with a variety of classifiers and data sets to compare local and global calibration methods and to determine the role that hidden heterogeneity plays.
Researcher Affiliation	Academia	Kiri L. Wagstaff EMAIL School of Electrical Engineering and Computer Science Oregon State University Thomas G. Dietterich EMAIL School of Electrical Engineering and Computer Science Oregon State University
Pseudocode	Yes	Algorithm 1 Hidden Heterogeneity (HH) Input: Test item xt, calibration data C, predicted probabilities ˆp, and radius r Output: Hidden heterogeneity in neighborhood around xt 1: Construct probability neighborhood around xt: Ut = {xi C\|DH(ˆpt, ˆpi) < r} (using Eqn. 2). 2: Train gt using labeled data in Ut. 3: Collect model predictions for the neighborhood: f(Ut) = {ˆpi\|xi Ut}. 4: Collect gt predictions for the neighborhood: gt(Ut) = {gt(xi)\|xi Ut}. 5: Collect labels for the neighborhood: YUt = {yi\|xi Ut}. 6: Calculate HHUt using f(Ut), gt(Ut), and YUt (Eqn. 3). (and similar for Algorithm 2 Similarity-Weighted Calibration (SWC))
Open Source Code	Yes	Our implementations of SWC, SWC-HH, and other calibration methods, along with scripts to replicate the experiments, are available at https://github.com/ wkiri/simcalib.
Open Datasets	Yes	letter (letter recognition): a 26-class data set ... available at https://archive.ics.uci.edu/ml/datasets/letter+recognition. mnist: the MNIST handwritten digit data set (Le Cun et al., 1998) ... original source is http://yann.lecun.com/exdb/mnist/. fashion-mnist: grayscale images of clothing and accessories ... available at https://github.com/zalandoresearch/fashion-mnist. CIFAR-10: 60,000 images ... (Krizhevsky, 2009) CIFAR-100: a disjoint set of 60,000 images ... (Krizhevsky, 2009)
Dataset Splits	Yes	For tabular data sets, we randomly sampled 10,000 items and for each trial randomly split them into 500 train, 500 test, and 9000 for a calibration pool. For the mnist10 and letter data sets, we used 1000 items each for training and test, due to their large number of classes (10 and 26, respectively). ... For image data sets, in each trial we generated a class-stratified random split of the standard test set into 5000 test items and reserved the remainder as the calibration set.
Hardware Specification	No	No specific hardware details (like GPU models, CPU types, or memory) used for running the experiments are provided in the paper.
Software Dependencies	No	The paper mentions using the 'scikit-learn Python library' but does not specify its version. It also refers to 'pre-trained neural networks' and a GitHub repository for their models, but lacks specific software dependency versions for libraries like PyTorch or TensorFlow.
Experiment Setup	Yes	We assessed calibration methods for six tabular data classifiers as implemented in the scikit-learn Python library (Pedregosa et al., 2011), including a decision tree with min_samples_leaf = 10 (DT), a random forest with 200 trees (RF), an ensemble of 200 gradient-boosted trees (GBT), a linear support vector machine (SVM), a Gaussian kernel (γ = 1 d var(X), C = 1.0) support vector machine (RBFSVM), and a Naive Bayes classifier (NB). Any parameters not explicitly mentioned were set to their default values. ... We employ the calibration data to learn the relevant RFprox measure using a random forest with 100 trees, no depth limit, and considering a random set of d features for each split. ... We trained a bagged ensemble of 50 decision trees with no depth limit and no limit on the number of features searched for each split. ... We searched over 7 values of the α pruning complexity parameter, evenly spaced between 0.0 (no pruning) and 0.03.