Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Calibrated Learning to Defer with One-vs-All Classifiers

Authors: Rajeev Verma, Eric Nalisnick

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments verify that not only is our system calibrated, but this benefit comes at no cost to accuracy. Our model s accuracy is always comparable (and often superior) to Mozannar & Sontag s (2020) model s in tasks ranging from hate speech detection to galaxy classification to diagnosis of skin lesions.
Researcher Affiliation	Academia	Rajeev Verma 1 Eric Nalisnick 1 Informatics Institute, University of Amsterdam, Amsterdam, Netherlands. Correspondence to: Rajeev Verma <EMAIL>, Eric Nalisnick <EMAIL>.
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our software implementations are publicly available.1 https://github.com/rajevv/Ov A-L2D
Open Datasets	Yes	We use the standard train-test splits of CIFAR-10 (Krizhevsky, 2009). We also use HAM10000 (Tschandl et al., 2018), Galaxy-Zoo (Bamford et al., 2009), and Hate Speech (Davidson et al., 2017) datasets.
Dataset Splits	Yes	We further partition the training split by 90% 10% to form training and validation sets, respectively. We partition the data into 60% training, 20% validation, and 20% test splits.
Hardware Specification	No	The paper states training was done using
Software Dependencies	No	The paper mentions using SGD and Adam optimizers, Wide Residual Networks, MLPMixer, and ResNet34 models, but does not specify software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8).
Experiment Setup	Yes	We use SGD with a momentum of 0.9, weight decay 5e 4, and initial learning rate of 0.1. We further use cosine annealing learning rate schedule. We train this model with Adam optimization algorithm with a learning rate of 0.001, weight decay of 5e 4. We further use cosine annealing learning rate schedule with a warm-up period of 5 epochs.