Using AI Uncertainty Quantification to Improve Human Decision-Making

Authors: Laura Marusich, Jonathan Bakdash, Yan Zhou, Murat Kantarcioglu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated the impact on human decision-making for instance-level UQ, calibrated using a strict scoring rule, in two online behavioral experiments. In the first experiment, our results showed that UQ was beneficial for decision-making performance compared to only AI predictions. In the second experiment, we found UQ had generalizable benefits for decisionmaking across a variety of representations for probabilistic information. These results indicate that implementing high quality, instance-level UQ for AI may improve decision-making with real systems compared to AI predictions alone.
Researcher Affiliation Collaboration 1DEVCOM Army Research Laboratory 2University of Texas at Dallas, Richardson, TX.
Pseudocode No The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm," nor does it present structured code-like steps for a procedure.
Open Source Code Yes See supplementary material at https://osf.io/cb762/.
Open Datasets Yes We assessed our research questions using three different publicly-available and widely-used datasets: the Census, German Credit, and Student Performance datasets from the UCI Machine Learning Repository (Dua & Graff, 2017), described in more detail below.
Dataset Splits No The paper states: "Each dataset was split into training (70%) and test (30%) data sets." While it mentions train and test splits, it does not explicitly state a validation set split or provide details for one.
Hardware Specification Yes All classification tasks were completed on an Intel Xeon machine with a 2.30GHz CPU.
Software Dependencies No The paper mentions "jsPysch (De Leeuw, 2015)" and "Just Another Tool for Online Studies (JATOS) https://github.com/JATOS/JATOS" but does not specify their version numbers. It also mentions machine learning models (e.g., random forest) but not the specific libraries and their versions used for implementation.
Experiment Setup Yes In our study, we aim to provide predictive uncertainty quantification to human decision-makers and use the advantage of knowing the true labels in advance. Therefore, we simplify the problem as sampling predictive confidence from samples of x with a small random disturbance and verify the quality of the uncertainty estimate using a strictly proper scoring rule (Gneiting & Raftery, 2007) before showing it to the human. ... In this study, we set n = 100 and σ0 = 0.1. ... In the experiment, we let n = 100 and δ = 0.1 which provided sufficient statistical significance and constrained neighborhood choices. ... Each trial of this task included a description of an individual and a two-alternative forced choice for the classification of that individual. Each choice was correct on 50% of the trials, thus chance performance for human decision-making accuracy was 50%. ... After making a decision, participants then entered their confidence in that choice, on a Likert scale of 1 (No Confidence) to 5 (Full Confidence). ... for each participant, we randomly sampled 40 of those 50 instances for the block of test trials... completed 8 practice trials, followed by 40 test trials.