Teaching Humans When to Defer to a Classifier via Exemplars

Authors: Hussein Mozannar, Arvind Satyanarayan, David Sontag5323-5331

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For our main evaluation, we conduct experiments on Amazon Mechanical Turk on the task of passage-based question answering from Hotpot QA (Yang et al. 2018). Crowdworkers first performed a teaching phase and were then tested on a randomly chosen subset of examples. Our results demonstrate the importance of teaching: around half of the participants who undertook the teaching phase were able to correctly determine the AI s region of error and had a resulting improved performance. We furthermore validate our method on a set of synthetic experiments.
Researcher Affiliation Academia Hussein Mozannar, Arvind Satyanarayan, David Sontag Massachusetts Institute of Technology mozannar@mit.edu
Pseudocode No The paper describes a 'greedy algorithm (GREEDY-SELECT)' with steps and equations, but it is presented within the main text rather than a clearly structured pseudocode or algorithm block.
Open Source Code No The paper does not provide an explicit statement about releasing its source code or a link to a code repository for the methodology described.
Open Datasets Yes We rely on the Hotpot QA dataset (Yang et al. 2018) collected by crowdsourcing based on Wikipedia articles. ... To complement our NLP-based experiments, we run a study on CIFAR-10 (Krizhevsky, Hinton et al. 2009) consisting of images from 10 classes.
Dataset Splits Yes We further remove yes/no questions from the dataset and only consider hard multi hop questions from the train set of 14631 examples and the dev set of 6947 examples.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run its experiments.
Software Dependencies No The paper mentions software components like 'RoBERTa embeddings', 'K-means', 'Sentence BERT model', and 'LIME', but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup No The paper describes parameters for its simulated AI and human models (e.g., 'kp = 15 and a vector of errors errp where for each i, errp[i] is drawn i.i.d. from Beta(αai, βai)', 'human prior thresholds the probability error of the human to a constant ϵ'), and the overall experimental design. However, it does not provide specific training hyperparameters such as learning rates, batch sizes, or number of epochs for any models used (e.g., the Wide Res Net).