Uncalibrated Models Can Improve Human-AI Collaboration
Authors: Kailas Vodrahalli, Tobias Gerstenberg, James Y. Zou
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present an initial exploration that suggests showing AI models as more confident than they actually are, even when the original AI is well-calibrated, can improve human-AI performance (measured as the accuracy and confidence of the human s final prediction after seeing the AI advice). We first train a model to predict human incorporation of AI advice using data from thousands of human-AI interactions. This enables us to explicitly estimate how to transform the AI s prediction confidence, making the AI uncalibrated, in order to improve the final human prediction. We empirically validate our results across four different tasks dealing with images, text and tabular data involving hundreds of human participants. We further support our findings with simulation analysis. |
| Researcher Affiliation | Academia | Kailas Vodrahalli Stanford University kailasv@stanford.edu Tobias Gerstenberg Stanford University gerstenberg@stanford.edu James Zou Stanford University jamesz@stanford.edu |
| Pseudocode | No | The paper describes the optimization procedure and human behavior model verbally and with diagrams (e.g., Figure 4), but it does not include any dedicated pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | We include instructions and materials to reproduce experimental results in the Supplement. |
| Open Datasets | Yes | We collected data from crowd-sourced participants on a diverse set of tasks that all have the same basic design. ... The data collected is summarized in Table 1. This data is publicly available at https://github.com/kailas-v/human-ai-interactions. ... Sarcasm dataset (Text data) This dataset is a subset of the Reddit sarcasm dataset [10] ... Census dataset (Tabular data) This dataset comes from US census data [32]. |
| Dataset Splits | No | The paper mentions using validation loss: "We trained the network using binary cross-entropy loss with an Adam optimizer [11] and stopped training when validation loss increased, as is standard." However, it does not specify the exact percentages or sample counts for the training, validation, or test splits, nor does it cite predefined splits in the main text. |
| Hardware Specification | No | The paper states: "Compute resources were minimal (i.e., could be run on a single core laptop in under a minute)". This indicates the type of machine, but does not provide specific hardware details such as GPU/CPU models, memory, or cloud provider names. |
| Software Dependencies | No | The paper mentions using an "Adam optimizer [11]" and "binary cross-entropy loss" and "stochastic gradient descent". However, it does not provide specific version numbers for any software libraries (e.g., Python, TensorFlow, PyTorch, scikit-learn) or other ancillary software components used for the experiments. |
| Experiment Setup | Yes | We use a 3 layer neural network with Re LU non-linearities for the activation model. We trained the network using binary cross-entropy loss with an Adam optimizer [11] and stopped training when validation loss increased, as is standard. ... Our integration model is similar to our activation model, but optimized for a different function. In particular, we use the same input features, model architecture, and training procedure. The integration model outputs a prediction for sign(r1)(r2 r1)... The training minimizes the RMSE between the predicted change and the actual change. |