Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Hidden Heterogeneity: When to Choose Similarity-Based Calibration
Authors: Kiri L. Wagstaff, Thomas G Dietterich
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted experiments with a variety of classifiers and data sets to compare local and global calibration methods and to determine the role that hidden heterogeneity plays. |
| Researcher Affiliation | Academia | Kiri L. Wagstaff EMAIL School of Electrical Engineering and Computer Science Oregon State University Thomas G. Dietterich EMAIL School of Electrical Engineering and Computer Science Oregon State University |
| Pseudocode | Yes | Algorithm 1 Hidden Heterogeneity (HH) Input: Test item xt, calibration data C, predicted probabilities ˆp, and radius r Output: Hidden heterogeneity in neighborhood around xt 1: Construct probability neighborhood around xt: Ut = {xi C|DH(ˆpt, ˆpi) < r} (using Eqn. 2). 2: Train gt using labeled data in Ut. 3: Collect model predictions for the neighborhood: f(Ut) = {ˆpi|xi Ut}. 4: Collect gt predictions for the neighborhood: gt(Ut) = {gt(xi)|xi Ut}. 5: Collect labels for the neighborhood: YUt = {yi|xi Ut}. 6: Calculate HHUt using f(Ut), gt(Ut), and YUt (Eqn. 3). (and similar for Algorithm 2 Similarity-Weighted Calibration (SWC)) |
| Open Source Code | Yes | Our implementations of SWC, SWC-HH, and other calibration methods, along with scripts to replicate the experiments, are available at https://github.com/ wkiri/simcalib. |
| Open Datasets | Yes | letter (letter recognition): a 26-class data set ... available at https://archive.ics.uci.edu/ml/datasets/letter+recognition. mnist: the MNIST handwritten digit data set (Le Cun et al., 1998) ... original source is http://yann.lecun.com/exdb/mnist/. fashion-mnist: grayscale images of clothing and accessories ... available at https://github.com/zalandoresearch/fashion-mnist. CIFAR-10: 60,000 images ... (Krizhevsky, 2009) CIFAR-100: a disjoint set of 60,000 images ... (Krizhevsky, 2009) |
| Dataset Splits | Yes | For tabular data sets, we randomly sampled 10,000 items and for each trial randomly split them into 500 train, 500 test, and 9000 for a calibration pool. For the mnist10 and letter data sets, we used 1000 items each for training and test, due to their large number of classes (10 and 26, respectively). ... For image data sets, in each trial we generated a class-stratified random split of the standard test set into 5000 test items and reserved the remainder as the calibration set. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory) used for running the experiments are provided in the paper. |
| Software Dependencies | No | The paper mentions using the 'scikit-learn Python library' but does not specify its version. It also refers to 'pre-trained neural networks' and a GitHub repository for their models, but lacks specific software dependency versions for libraries like PyTorch or TensorFlow. |
| Experiment Setup | Yes | We assessed calibration methods for six tabular data classifiers as implemented in the scikit-learn Python library (Pedregosa et al., 2011), including a decision tree with min_samples_leaf = 10 (DT), a random forest with 200 trees (RF), an ensemble of 200 gradient-boosted trees (GBT), a linear support vector machine (SVM), a Gaussian kernel (γ = 1 d var(X), C = 1.0) support vector machine (RBFSVM), and a Naive Bayes classifier (NB). Any parameters not explicitly mentioned were set to their default values. ... We employ the calibration data to learn the relevant RFprox measure using a random forest with 100 trees, no depth limit, and considering a random set of d features for each split. ... We trained a bagged ensemble of 50 decision trees with no depth limit and no limit on the number of features searched for each split. ... We searched over 7 values of the α pruning complexity parameter, evenly spaced between 0.0 (no pruning) and 0.03. |