reproducibilityindex.ai

Consistent Estimators for Learning to Defer to an Expert

Authors: Hussein Mozannar, David Sontag

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show the effectiveness of our approach on a variety of experimental tasks. We provide a detailed experimental evaluation of our method and baselines from the literature on image and text classiﬁcation tasks.
Researcher Affiliation	Academia	1CSAIL, Massachusetts Institute of Technology, Cambridge, MA, USA.
Pseudocode	No	The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code	Yes	We provide code to reproduce our experiments 1. (Footnote 1: https://github.com/clinicalml/learn-todefer)
Open Datasets	Yes	We give experimental evidence on image classiﬁcation datasets CIFAR-10 and CIFAR-100 using synthetic and human experts based on CIFAR10H (Peterson et al., 2019), on a hate speech and offensive language detection task (Davidson et al., 2017), and on classiﬁcation of chest X-rays with synthetic experts... CIFAR-10 image classiﬁcation dataset (Krizhevsky et al., 2009)... we use the dataset CIFAR10H (Peterson et al., 2019)... Che Xpert is a large chest radiograph dataset... (Irvin et al., 2019)... dataset created by Davidson et al. (2017).
Dataset Splits	Yes	CIFAR-10... split into 50,000 train and 10,000 test images. We randomly split the test set in half where one half constitutes Sl and the other is for testing; we randomize the splitting over 10 trials. We use the downsampled resolution version of Che Xpert (Irvin et al., 2019) and split the training data set with an 80-10-10 split on a patient basis for training, validation and testing respectively, no patients are shared among the splits. We randomly split the dataset with a 60, 10, 30% split into a training, validation and test set respectively; we repeat the experiments for 5 random splits.
Hardware Specification	No	The paper mentions using Wide Residual Networks and Dense Net121 architectures, but does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running experiments.
Software Dependencies	No	The paper mentions using optimizers like SGD with momentum and Adam, and model architectures like Wide Res Nets and Dense Net121, but does not provide specific version numbers for software libraries, programming languages (e.g., Python), or other dependencies.
Experiment Setup	Yes	We use SGD with momentum and a cosine annealing learning rate schedule. We train the baseline models using Adam for 4 epochs. For our approach we train for 3 epochs using the cross entropy loss and then train for one epoch using Lα CE with α chosen to maximize the area under the receiver operating characteristic curve (AU-ROC) of the combined system on the validation set for each of the 5 tasks (each task is treated separately). We used a grid search over the validation set to ﬁnd α. For our model we use the CNN developed in (Kim, 2014) for text classiﬁcation with 100 dimensional Glove embeddings (Pennington et al., 2014) and 300 ﬁlters of sizes {3, 4, 5} using dropout.