Teaching Humans When to Defer to a Classifier via Exemplars
Authors: Hussein Mozannar, Arvind Satyanarayan, David Sontag5323-5331
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For our main evaluation, we conduct experiments on Amazon Mechanical Turk on the task of passage-based question answering from Hotpot QA (Yang et al. 2018). Crowdworkers first performed a teaching phase and were then tested on a randomly chosen subset of examples. Our results demonstrate the importance of teaching: around half of the participants who undertook the teaching phase were able to correctly determine the AI s region of error and had a resulting improved performance. We furthermore validate our method on a set of synthetic experiments. |
| Researcher Affiliation | Academia | Hussein Mozannar, Arvind Satyanarayan, David Sontag Massachusetts Institute of Technology mozannar@mit.edu |
| Pseudocode | No | The paper describes a 'greedy algorithm (GREEDY-SELECT)' with steps and equations, but it is presented within the main text rather than a clearly structured pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We rely on the Hotpot QA dataset (Yang et al. 2018) collected by crowdsourcing based on Wikipedia articles. ... To complement our NLP-based experiments, we run a study on CIFAR-10 (Krizhevsky, Hinton et al. 2009) consisting of images from 10 classes. |
| Dataset Splits | Yes | We further remove yes/no questions from the dataset and only consider hard multi hop questions from the train set of 14631 examples and the dev set of 6947 examples. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run its experiments. |
| Software Dependencies | No | The paper mentions software components like 'RoBERTa embeddings', 'K-means', 'Sentence BERT model', and 'LIME', but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | No | The paper describes parameters for its simulated AI and human models (e.g., 'kp = 15 and a vector of errors errp where for each i, errp[i] is drawn i.i.d. from Beta(αai, βai)', 'human prior thresholds the probability error of the human to a constant ϵ'), and the overall experimental design. However, it does not provide specific training hyperparameters such as learning rates, batch sizes, or number of epochs for any models used (e.g., the Wide Res Net). |