Towards Human-AI Complementarity with Prediction Sets

Authors: Giovanni De Toni, Nastaran Okati, Suhas Thejaswi, Eleni Straitouri, Manuel Rodriguez

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Further, using a simulation study with both synthetic and real expert predictions, we demonstrate that, in practice, our greedy algorithm finds near-optimal prediction sets offering greater performance than conformal prediction.
Researcher Affiliation Academia Giovanni De Toni Fondazione Bruno Kessler & University of Trento Trento, Italy giovanni.detoni@unitn.it Nastaran Okati Max Planck Institute for Software Systems Kaiserslautern, Germany nastaran@mpi-sws.org Suhas Thejaswi Max Planck Institute for Software Systems Kaiserslautern, Germany thejaswi@mpi-sws.org Eleni Straitouri Max Planck Institute for Software Systems Kaiserslautern, Germany estraitouri@mpi-sws.org Manuel Gomez-Rodriguez Max Planck Institute for Software Systems Kaiserslautern, Germany manuelgr@mpi-sws.org
Pseudocode Yes Algorithm 1: Greedy algorithm Input: Label set Y, features x, classifier f, confusion matrix C Output: Prediction set S
Open Source Code Yes We have released an open-source implementation of our greedy algorithm as well as the code and data used in our experiments at https://github.com/Networks-Learning/ towards-human-ai-complementarity-predictions-sets.
Open Datasets Yes We experiment with the Image Net16H dataset [7], which was created using 1,200 natural images from the Image Net Large Scale Visual Recognition Challenge (ILSRVR) 2012 dataset [55].
Dataset Splits Yes For each classification task, we generate 19,000 samples, which we split into a training set (16,000 samples), a calibration set (1000 samples), a validation set (1000 samples) and a test set (1000 samples).
Hardware Specification Yes We run the experiment on a Linux machine equipped with an Intel Xeon(R) Gold 6252N CPU, with 96 cores and 1024 GB of RAM.
Software Dependencies No The code infrastructure was written using Python 3.8 and the standard set of scientific opensource libraries (e.g., numpy, pandas, scikit-learn, etc.).
Experiment Setup Yes We employ the make_classification utility function of scikit-learn to generate the various prediction task. It is a convenient method to generate L-class classification tasks by varying several parameters such as the task difficulty, the number of labels and the number of informative features. In Section 5, we set the number of features to 20, the number of redundant features to 0 and the number of informative features to d = 4 (for a L = 10 label classification task). We assign a balanced proportion of samples for each class. We control the task difficulty by choosing the class_sep parameter, which represents the length of the sides of the hypercubes, thus indicating how far apart are the various classes. A smaller class_sep implies a more difficult classification task. We use the deep neural network classifier VGG-19 [56] after 10 epochs of fine-tuning as provided by Steyvers et al. [7]. Further, we randomly split the images (and expert predictions) in each group into two disjoint subsets, a calibration set (800 images), and a test set (400 images). ... We use the calibration set to (i) calibrate the (softmax) outputs of the VGG-19 scores using top-k-label calibration with k = 5, (ii) estimate the confusion matrix C that parameterizes the mixture of MNLs used to model the simulated human expert, and (iii) calculate the quantile ˆqα used by conformal prediction. For RAPS and SAPS, we run the procedure outlined in Appendix E in Angelopoulous et al. [49] to optimize the additional hyperparameters, kregs and λraps, for RAPS, and λsaps for SAPS, using the validation set.