On the Utility of Prediction Sets in Human-AI Teams
Authors: Varun Babbar, Umang Bhatt, Adrian Weller
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation on human subjects finds that set valued predictions positively impact experts. However, we notice that the predictive sets provided by CP can be very large, which leads to unhelpful AI assistants. To mitigate this, we introduce D-CP, a method to perform CP on some examples and defer to experts. We prove that D-CP can reduce the prediction set size of non-deferred examples. We show how D-CP performs in quantitative and in human subject experiments (n=120). |
| Researcher Affiliation | Academia | Varun Babbar1 , Umang Bhatt 1,2 , Adrian Weller1,2 1University of Cambridge 2The Alan Turing Institute {vb395, usb20, aw665}@cam.ac.uk |
| Pseudocode | Yes | Algorithm 1 General D-CP |
| Open Source Code | Yes | Our code is hosted at https://github.com/cambridge-mlg/d-cp. |
| Open Datasets | Yes | Our first study focuses on establishing the value of set valued predictions. For our experiments, we focus on one particular CP scheme called Regularised Adaptive Prediction Sets (RAPS) [Angelopoulos et al., 2020]. We recruit 30 participants on Prolific, paying them at a rate of 10 per hour prorated, and divide them into 2 equal groups. The first group is shown 18 images from the CIFAR-100 dataset alongside the model s most probable prediction (Top-1). |
| Dataset Splits | Yes | This requires an additional calibration dataset Dcal = {(Xi, Yi)}n i=1 drawn from the same distribution as training and validation sets. After training a classifier on a training dataset, we can use this calibration dataset to choose the α Quantile threshold τcal. |
| Hardware Specification | No | No specific hardware (e.g., GPU models, CPU types) used for running experiments was mentioned. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were mentioned. |
| Experiment Setup | Yes | We train a Wide Res Net [Zagoruyko and Komodakis, 2016] classifier mθ(x) : X Y on CIFAR-10H and CIFAR-100 for 5 and 10 epochs respectively using the learning rate schedule in [Mozannar and Sontag, 2020]. |