Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Calibrated Learning to Defer with One-vs-All Classifiers
Authors: Rajeev Verma, Eric Nalisnick
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments verify that not only is our system calibrated, but this benefit comes at no cost to accuracy. Our model s accuracy is always comparable (and often superior) to Mozannar & Sontag s (2020) model s in tasks ranging from hate speech detection to galaxy classification to diagnosis of skin lesions. |
| Researcher Affiliation | Academia | Rajeev Verma 1 Eric Nalisnick 1 Informatics Institute, University of Amsterdam, Amsterdam, Netherlands. Correspondence to: Rajeev Verma <EMAIL>, Eric Nalisnick <EMAIL>. |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our software implementations are publicly available.1 https://github.com/rajevv/Ov A-L2D |
| Open Datasets | Yes | We use the standard train-test splits of CIFAR-10 (Krizhevsky, 2009). We also use HAM10000 (Tschandl et al., 2018), Galaxy-Zoo (Bamford et al., 2009), and Hate Speech (Davidson et al., 2017) datasets. |
| Dataset Splits | Yes | We further partition the training split by 90% 10% to form training and validation sets, respectively. We partition the data into 60% training, 20% validation, and 20% test splits. |
| Hardware Specification | No | The paper states training was done using |
| Software Dependencies | No | The paper mentions using SGD and Adam optimizers, Wide Residual Networks, MLPMixer, and ResNet34 models, but does not specify software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8). |
| Experiment Setup | Yes | We use SGD with a momentum of 0.9, weight decay 5e 4, and initial learning rate of 0.1. We further use cosine annealing learning rate schedule. We train this model with Adam optimization algorithm with a learning rate of 0.001, weight decay of 5e 4. We further use cosine annealing learning rate schedule with a warm-up period of 5 epochs. |