reproducibilityindex.ai

Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff

Authors: Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S. Weld, Walter S. Lasecki, Eric Horvitz2429-2437

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on three high-stakes classiﬁcation tasks show that current machine learning algorithms do not produce compatible updates. We propose a re-training objective to improve the compatibility of an update by penalizing new errors. The objective offers full leverage of the performance/compatibility tradeoff across different datasets, enabling more compatible yet accurate updates.
Researcher Affiliation	Collaboration	Gagan Bansal,1 Besmira Nushi,2 Ece Kamar,2 Daniel S. Weld,1 Walter S. Lasecki,3 Eric Horvitz2 1University of Washington, 2Microsoft Research, 3University of Michigan
Pseudocode	No	The paper defines mathematical equations for loss functions but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We introduce an open-source experimental platform2 for studying how people model the error boundary of an AI teammate in the presence of updates for a an AI-advised decision-making task. The platform exposes important design factors (e.g., task complexity, reward, update type) to the experimenter. 2Available at https://github.com/gagb/caja
Open Datasets	Yes	Datasets. To investigate whether a tradeoff exists between performance and compatibility of an update, we simulate updates to classiﬁers for three domains: recidivism prediction (Will a convict commit another crime?)(Angwin et al. 2016), in-hospital mortality prediction (Will a patient die in the hospital?) (Johnson et al. 2016; Harutyunyan et al. 2017), and credit risk assessment (Will a borrower fail to pay back?)4.
Dataset Splits	No	The paper mentions training on specific numbers of examples (200 and 5000) and refers to general use of 'validation set', but does not provide explicit training/test/validation dataset split percentages, absolute counts for distinct sets, or specific methodology for reproduction beyond training data sizes.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU or CPU models, memory, or types of computing instances used for the experiments.
Software Dependencies	No	The paper mentions machine learning models (logistic regression, MLP) and loss functions, but does not provide specific version numbers for any software libraries, frameworks, or dependencies used in the implementation or experiments.
Experiment Setup	No	The paper describes varying the λc parameter in their reformulated objective, but does not provide specific hyperparameter values such as learning rates, batch sizes, epochs, or optimizer settings used for training the models in their experiments.