reproducibilityindex.ai

On the Utility of Prediction Sets in Human-AI Teams

Authors: Varun Babbar, Umang Bhatt, Adrian Weller

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation on human subjects finds that set valued predictions positively impact experts. However, we notice that the predictive sets provided by CP can be very large, which leads to unhelpful AI assistants. To mitigate this, we introduce D-CP, a method to perform CP on some examples and defer to experts. We prove that D-CP can reduce the prediction set size of non-deferred examples. We show how D-CP performs in quantitative and in human subject experiments (n=120).
Researcher Affiliation	Academia	Varun Babbar1 , Umang Bhatt 1,2 , Adrian Weller1,2 1University of Cambridge 2The Alan Turing Institute {vb395, usb20, aw665}@cam.ac.uk
Pseudocode	Yes	Algorithm 1 General D-CP
Open Source Code	Yes	Our code is hosted at https://github.com/cambridge-mlg/d-cp.
Open Datasets	Yes	Our first study focuses on establishing the value of set valued predictions. For our experiments, we focus on one particular CP scheme called Regularised Adaptive Prediction Sets (RAPS) [Angelopoulos et al., 2020]. We recruit 30 participants on Prolific, paying them at a rate of 10 per hour prorated, and divide them into 2 equal groups. The first group is shown 18 images from the CIFAR-100 dataset alongside the model s most probable prediction (Top-1).
Dataset Splits	Yes	This requires an additional calibration dataset Dcal = {(Xi, Yi)}n i=1 drawn from the same distribution as training and validation sets. After training a classifier on a training dataset, we can use this calibration dataset to choose the α Quantile threshold τcal.
Hardware Specification	No	No specific hardware (e.g., GPU models, CPU types) used for running experiments was mentioned.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were mentioned.
Experiment Setup	Yes	We train a Wide Res Net [Zagoruyko and Komodakis, 2016] classifier mθ(x) : X Y on CIFAR-10H and CIFAR-100 for 5 and 10 epochs respectively using the learning rate schedule in [Mozannar and Sontag, 2020].