reproducibilityindex.ai

Teaching Humans When to Defer to a Classifier via Exemplars

Authors: Hussein Mozannar, Arvind Satyanarayan, David Sontag5323-5331

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For our main evaluation, we conduct experiments on Amazon Mechanical Turk on the task of passage-based question answering from Hotpot QA (Yang et al. 2018). Crowdworkers ﬁrst performed a teaching phase and were then tested on a randomly chosen subset of examples. Our results demonstrate the importance of teaching: around half of the participants who undertook the teaching phase were able to correctly determine the AI s region of error and had a resulting improved performance. We furthermore validate our method on a set of synthetic experiments.
Researcher Affiliation	Academia	Hussein Mozannar, Arvind Satyanarayan, David Sontag Massachusetts Institute of Technology mozannar@mit.edu
Pseudocode	No	The paper describes a 'greedy algorithm (GREEDY-SELECT)' with steps and equations, but it is presented within the main text rather than a clearly structured pseudocode or algorithm block.
Open Source Code	No	The paper does not provide an explicit statement about releasing its source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We rely on the Hotpot QA dataset (Yang et al. 2018) collected by crowdsourcing based on Wikipedia articles. ... To complement our NLP-based experiments, we run a study on CIFAR-10 (Krizhevsky, Hinton et al. 2009) consisting of images from 10 classes.
Dataset Splits	Yes	We further remove yes/no questions from the dataset and only consider hard multi hop questions from the train set of 14631 examples and the dev set of 6947 examples.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run its experiments.
Software Dependencies	No	The paper mentions software components like 'RoBERTa embeddings', 'K-means', 'Sentence BERT model', and 'LIME', but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	No	The paper describes parameters for its simulated AI and human models (e.g., 'kp = 15 and a vector of errors errp where for each i, errp[i] is drawn i.i.d. from Beta(αai, βai)', 'human prior thresholds the probability error of the human to a constant ϵ'), and the overall experimental design. However, it does not provide specific training hyperparameters such as learning rates, batch sizes, or number of epochs for any models used (e.g., the Wide Res Net).