Learning Optimal Conformal Classifiers

Authors: David Stutz, Krishnamurthy Dj Dvijotham, Ali Taylan Cemgil, Arnaud Doucet

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On experiments with several datasets, we show Conf Tr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.
Researcher Affiliation Collaboration David Stutz1,2, Krishnamurthy (Dj) Dvijotham1, Ali Taylan Cemgil1, Arnaud Doucet1 1 Deep Mind, 2 Max Planck Institute for Informatics, Saarland Informatics Campus
Pseudocode Yes Algorithm 1: Smooth CP and Conformal Training (Conf Tr): Top left: At test time, for THR, PREDICT computes the conformity scores Eθ(x, k) for each k [K] and constructs the confidence sets Cθ(x; τ) by thresholding with τ. CALIBRATE determines the threshold τ as the α(1 + 1/n)-quantile of the conformity scores w.r.t. the true classes yi on a calibration set {(xi, yi)} of size n:=|Ical|. THR and APS use different conformity scores. Right and bottom left: Conf Tr calibrates on a part of each mini-batch, Bcal. Thereby, we obtain guaranteed coverage on the other part, Bpred (in expectation across batches). Then, the inefficiency on Bpred is minimized to update the model parameters θ. Smooth implementations of calibration and prediction are used.
Open Source Code Yes App. P (specifically Alg. B) lists the corresponding Python implementation of these key components.
Open Datasets Yes We consider Camelyon2016 (Bejnordi et al., 2017), German Credit (Dua & Graff, 2017), Wine Quality (Cortez et al., 2009), MNIST (Le Cun et al., 1998), EMNIST (Cohen et al., 2017), Fashion-MNIST (Cohen et al., 2017) and CIFAR (Krizhevsky, 2009) with a fixed split of training, calibration and test examples. Tab. A summarizes key statistics of the used datasets which we elaborate on in the following. Except Camelyon, all datasets are provided by Tensorflow (Abadi et al., 2015)1.
Dataset Splits Yes Tab. A reports the training/calibration/test splits of all used datasets and Tab. B the used hyper-parameters for Conf Tr.
Hardware Specification No The paper mentions using linear models, MLPs, and ResNets, and that the models are implemented in Jax and Haiku, but it does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instance details) used for running the experiments.
Software Dependencies Yes Models and training are implemented in Jax (Bradbury et al., 2018)3 and the Res Nets follow the implementation and architecture provided by Haiku (Hennigan et al., 2020)4. ... Alg. B presents code in Python, using Jax (Bradbury et al., 2018), Haiku (Hennigan et al., 2020) and Optax (Hessel et al., 2020).
Experiment Setup Yes The final hyper-parameters selected for Conf Tr (for THR at test time) on all datasets are summarized in Tab. B. These were obtained using grid search over the following hyper-parameters: batch size in {1000, 500, 100} ... learning rate in {0.05, 0.01, 0.005}; temperature T {0.01, 0.1, 0.5, 1}; size weight λ {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10} (c.f. Eq. (1), right); and κ {0, 1} (c.f. Eq. (3)).