reproducibilityindex.ai

Combining Human Predictions with Model Probabilities via Confusion Matrices and Calibration

Authors: Gavin Kerrigan, Padhraic Smyth, Mark Steyvers

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on image classiﬁcation with CIFAR-10 and a subset of Image Net demonstrate that such human-model combinations consistently have higher accuracies than the model or human alone, and that the parameters of the combination method can be estimated effectively with as few as ten labeled datapoints. 5 Experiments
Researcher Affiliation	Academia	Gavin Kerrigan1 Padhraic Smyth1 Mark Steyvers2 1Department of Computer Science 2Department of Cognitive Sciences University of California, Irvine gavin.k@uci.edu smyth@ics.uci.edu mark.steyvers@uci.edu
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Code for our estimation methods and experiments is available at: https://github.com/Gavin Kerrigan/conf_matrix_and_calibration.
Open Datasets	Yes	We evaluate various combination strategies on two pre-existing image classiﬁcation datasets that include human annotations: CIFAR-10H [Peterson et al., 2019] and Image Net-16H [Steyvers et al., 2022]. The Image Net-16H dataset is available on the Open Science Foundation at https://osf.io/2ntrf/.
Dataset Splits	Yes	Both datasets are partitioned into three disjoint subsets: (i) a model training set DT , (ii) a combination training set DC, and (iii) an evaluation set DE. The combination training set and evaluation set are subsets of the original test sets, where 70% of the data is used for ﬁtting the combinations and 30% is used for evaluation.
Hardware Specification	No	The paper describes the models and datasets used but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions several models and calibration methods, and also 'PyTorch 1.9.0' in Appendix F, but does not provide a list of multiple key software components with their specific version numbers to allow for full reproduction of the environment.
Experiment Setup	Yes	In Appendix F, we describe our model architectures and training procedures in detail. All models were implemented with PyTorch 1.9.0 and trained using the Adam optimizer with a batch size of 128.