Uncalibrated Models Can Improve Human-AI Collaboration

Authors: Kailas Vodrahalli, Tobias Gerstenberg, James Y. Zou

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present an initial exploration that suggests showing AI models as more confident than they actually are, even when the original AI is well-calibrated, can improve human-AI performance (measured as the accuracy and confidence of the human s final prediction after seeing the AI advice). We first train a model to predict human incorporation of AI advice using data from thousands of human-AI interactions. This enables us to explicitly estimate how to transform the AI s prediction confidence, making the AI uncalibrated, in order to improve the final human prediction. We empirically validate our results across four different tasks dealing with images, text and tabular data involving hundreds of human participants. We further support our findings with simulation analysis.
Researcher Affiliation Academia Kailas Vodrahalli Stanford University kailasv@stanford.edu Tobias Gerstenberg Stanford University gerstenberg@stanford.edu James Zou Stanford University jamesz@stanford.edu
Pseudocode No The paper describes the optimization procedure and human behavior model verbally and with diagrams (e.g., Figure 4), but it does not include any dedicated pseudocode blocks or algorithm listings.
Open Source Code Yes We include instructions and materials to reproduce experimental results in the Supplement.
Open Datasets Yes We collected data from crowd-sourced participants on a diverse set of tasks that all have the same basic design. ... The data collected is summarized in Table 1. This data is publicly available at https://github.com/kailas-v/human-ai-interactions. ... Sarcasm dataset (Text data) This dataset is a subset of the Reddit sarcasm dataset [10] ... Census dataset (Tabular data) This dataset comes from US census data [32].
Dataset Splits No The paper mentions using validation loss: "We trained the network using binary cross-entropy loss with an Adam optimizer [11] and stopped training when validation loss increased, as is standard." However, it does not specify the exact percentages or sample counts for the training, validation, or test splits, nor does it cite predefined splits in the main text.
Hardware Specification No The paper states: "Compute resources were minimal (i.e., could be run on a single core laptop in under a minute)". This indicates the type of machine, but does not provide specific hardware details such as GPU/CPU models, memory, or cloud provider names.
Software Dependencies No The paper mentions using an "Adam optimizer [11]" and "binary cross-entropy loss" and "stochastic gradient descent". However, it does not provide specific version numbers for any software libraries (e.g., Python, TensorFlow, PyTorch, scikit-learn) or other ancillary software components used for the experiments.
Experiment Setup Yes We use a 3 layer neural network with Re LU non-linearities for the activation model. We trained the network using binary cross-entropy loss with an Adam optimizer [11] and stopped training when validation loss increased, as is standard. ... Our integration model is similar to our activation model, but optimized for a different function. In particular, we use the same input features, model architecture, and training procedure. The integration model outputs a prediction for sign(r1)(r2 r1)... The training minimizes the RMSE between the predicted change and the actual change.