Combining Human Predictions with Model Probabilities via Confusion Matrices and Calibration

Authors: Gavin Kerrigan, Padhraic Smyth, Mark Steyvers

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on image classification with CIFAR-10 and a subset of Image Net demonstrate that such human-model combinations consistently have higher accuracies than the model or human alone, and that the parameters of the combination method can be estimated effectively with as few as ten labeled datapoints. 5 Experiments
Researcher Affiliation Academia Gavin Kerrigan1 Padhraic Smyth1 Mark Steyvers2 1Department of Computer Science 2Department of Cognitive Sciences University of California, Irvine gavin.k@uci.edu smyth@ics.uci.edu mark.steyvers@uci.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code for our estimation methods and experiments is available at: https://github.com/Gavin Kerrigan/conf_matrix_and_calibration.
Open Datasets Yes We evaluate various combination strategies on two pre-existing image classification datasets that include human annotations: CIFAR-10H [Peterson et al., 2019] and Image Net-16H [Steyvers et al., 2022]. The Image Net-16H dataset is available on the Open Science Foundation at https://osf.io/2ntrf/.
Dataset Splits Yes Both datasets are partitioned into three disjoint subsets: (i) a model training set DT , (ii) a combination training set DC, and (iii) an evaluation set DE. The combination training set and evaluation set are subsets of the original test sets, where 70% of the data is used for fitting the combinations and 30% is used for evaluation.
Hardware Specification No The paper describes the models and datasets used but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions several models and calibration methods, and also 'PyTorch 1.9.0' in Appendix F, but does not provide a list of multiple key software components with their specific version numbers to allow for full reproduction of the environment.
Experiment Setup Yes In Appendix F, we describe our model architectures and training procedures in detail. All models were implemented with PyTorch 1.9.0 and trained using the Adam optimizer with a batch size of 128.