Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On the Utility of Prediction Sets in Human-AI Teams
Authors: Varun Babbar, Umang Bhatt, Adrian Weller
IJCAI 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation on human subjects finds that set valued predictions positively impact experts. However, we notice that the predictive sets provided by CP can be very large, which leads to unhelpful AI assistants. To mitigate this, we introduce D-CP, a method to perform CP on some examples and defer to experts. We prove that D-CP can reduce the prediction set size of non-deferred examples. We show how D-CP performs in quantitative and in human subject experiments (n=120). |
| Researcher Affiliation | Academia | Varun Babbar1 , Umang Bhatt 1,2 , Adrian Weller1,2 1University of Cambridge 2The Alan Turing Institute EMAIL |
| Pseudocode | Yes | Algorithm 1 General D-CP |
| Open Source Code | Yes | Our code is hosted at https://github.com/cambridge-mlg/d-cp. |
| Open Datasets | Yes | Our first study focuses on establishing the value of set valued predictions. For our experiments, we focus on one particular CP scheme called Regularised Adaptive Prediction Sets (RAPS) [Angelopoulos et al., 2020]. We recruit 30 participants on Prolific, paying them at a rate of 10 per hour prorated, and divide them into 2 equal groups. The first group is shown 18 images from the CIFAR-100 dataset alongside the model s most probable prediction (Top-1). |
| Dataset Splits | Yes | This requires an additional calibration dataset Dcal = {(Xi, Yi)}n i=1 drawn from the same distribution as training and validation sets. After training a classifier on a training dataset, we can use this calibration dataset to choose the ฮฑ Quantile threshold ฯcal. |
| Hardware Specification | No | No specific hardware (e.g., GPU models, CPU types) used for running experiments was mentioned. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were mentioned. |
| Experiment Setup | Yes | We train a Wide Res Net [Zagoruyko and Komodakis, 2016] classifier mฮธ(x) : X Y on CIFAR-10H and CIFAR-100 for 5 and 10 epochs respectively using the learning rate schedule in [Mozannar and Sontag, 2020]. |