reproducibilityindex.ai

Uncalibrated Models Can Improve Human-AI Collaboration

Authors: Kailas Vodrahalli, Tobias Gerstenberg, James Y. Zou

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we present an initial exploration that suggests showing AI models as more conﬁdent than they actually are, even when the original AI is well-calibrated, can improve human-AI performance (measured as the accuracy and conﬁdence of the human s ﬁnal prediction after seeing the AI advice). We ﬁrst train a model to predict human incorporation of AI advice using data from thousands of human-AI interactions. This enables us to explicitly estimate how to transform the AI s prediction conﬁdence, making the AI uncalibrated, in order to improve the ﬁnal human prediction. We empirically validate our results across four different tasks dealing with images, text and tabular data involving hundreds of human participants. We further support our ﬁndings with simulation analysis.
Researcher Affiliation	Academia	Kailas Vodrahalli Stanford University kailasv@stanford.edu Tobias Gerstenberg Stanford University gerstenberg@stanford.edu James Zou Stanford University jamesz@stanford.edu
Pseudocode	No	The paper describes the optimization procedure and human behavior model verbally and with diagrams (e.g., Figure 4), but it does not include any dedicated pseudocode blocks or algorithm listings.
Open Source Code	Yes	We include instructions and materials to reproduce experimental results in the Supplement.
Open Datasets	Yes	We collected data from crowd-sourced participants on a diverse set of tasks that all have the same basic design. ... The data collected is summarized in Table 1. This data is publicly available at https://github.com/kailas-v/human-ai-interactions. ... Sarcasm dataset (Text data) This dataset is a subset of the Reddit sarcasm dataset [10] ... Census dataset (Tabular data) This dataset comes from US census data [32].
Dataset Splits	No	The paper mentions using validation loss: "We trained the network using binary cross-entropy loss with an Adam optimizer [11] and stopped training when validation loss increased, as is standard." However, it does not specify the exact percentages or sample counts for the training, validation, or test splits, nor does it cite predefined splits in the main text.
Hardware Specification	No	The paper states: "Compute resources were minimal (i.e., could be run on a single core laptop in under a minute)". This indicates the type of machine, but does not provide specific hardware details such as GPU/CPU models, memory, or cloud provider names.
Software Dependencies	No	The paper mentions using an "Adam optimizer [11]" and "binary cross-entropy loss" and "stochastic gradient descent". However, it does not provide specific version numbers for any software libraries (e.g., Python, TensorFlow, PyTorch, scikit-learn) or other ancillary software components used for the experiments.
Experiment Setup	Yes	We use a 3 layer neural network with Re LU non-linearities for the activation model. We trained the network using binary cross-entropy loss with an Adam optimizer [11] and stopped training when validation loss increased, as is standard. ... Our integration model is similar to our activation model, but optimized for a different function. In particular, we use the same input features, model architecture, and training procedure. The integration model outputs a prediction for sign(r1)(r2 r1)... The training minimizes the RMSE between the predicted change and the actual change.