Consistent Estimators for Learning to Defer to an Expert

Authors: Hussein Mozannar, David Sontag

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the effectiveness of our approach on a variety of experimental tasks. We provide a detailed experimental evaluation of our method and baselines from the literature on image and text classification tasks.
Researcher Affiliation Academia 1CSAIL, Massachusetts Institute of Technology, Cambridge, MA, USA.
Pseudocode No The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code Yes We provide code to reproduce our experiments 1. (Footnote 1: https://github.com/clinicalml/learn-todefer)
Open Datasets Yes We give experimental evidence on image classification datasets CIFAR-10 and CIFAR-100 using synthetic and human experts based on CIFAR10H (Peterson et al., 2019), on a hate speech and offensive language detection task (Davidson et al., 2017), and on classification of chest X-rays with synthetic experts... CIFAR-10 image classification dataset (Krizhevsky et al., 2009)... we use the dataset CIFAR10H (Peterson et al., 2019)... Che Xpert is a large chest radiograph dataset... (Irvin et al., 2019)... dataset created by Davidson et al. (2017).
Dataset Splits Yes CIFAR-10... split into 50,000 train and 10,000 test images. We randomly split the test set in half where one half constitutes Sl and the other is for testing; we randomize the splitting over 10 trials. We use the downsampled resolution version of Che Xpert (Irvin et al., 2019) and split the training data set with an 80-10-10 split on a patient basis for training, validation and testing respectively, no patients are shared among the splits. We randomly split the dataset with a 60, 10, 30% split into a training, validation and test set respectively; we repeat the experiments for 5 random splits.
Hardware Specification No The paper mentions using Wide Residual Networks and Dense Net121 architectures, but does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running experiments.
Software Dependencies No The paper mentions using optimizers like SGD with momentum and Adam, and model architectures like Wide Res Nets and Dense Net121, but does not provide specific version numbers for software libraries, programming languages (e.g., Python), or other dependencies.
Experiment Setup Yes We use SGD with momentum and a cosine annealing learning rate schedule. We train the baseline models using Adam for 4 epochs. For our approach we train for 3 epochs using the cross entropy loss and then train for one epoch using Lα CE with α chosen to maximize the area under the receiver operating characteristic curve (AU-ROC) of the combined system on the validation set for each of the 5 tasks (each task is treated separately). We used a grid search over the validation set to find α. For our model we use the CNN developed in (Kim, 2014) for text classification with 100 dimensional Glove embeddings (Pennington et al., 2014) and 300 filters of sizes {3, 4, 5} using dropout.