Combining Human Predictions with Model Probabilities via Confusion Matrices and Calibration
Authors: Gavin Kerrigan, Padhraic Smyth, Mark Steyvers
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on image classification with CIFAR-10 and a subset of Image Net demonstrate that such human-model combinations consistently have higher accuracies than the model or human alone, and that the parameters of the combination method can be estimated effectively with as few as ten labeled datapoints. 5 Experiments |
| Researcher Affiliation | Academia | Gavin Kerrigan1 Padhraic Smyth1 Mark Steyvers2 1Department of Computer Science 2Department of Cognitive Sciences University of California, Irvine gavin.k@uci.edu smyth@ics.uci.edu mark.steyvers@uci.edu |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code for our estimation methods and experiments is available at: https://github.com/Gavin Kerrigan/conf_matrix_and_calibration. |
| Open Datasets | Yes | We evaluate various combination strategies on two pre-existing image classification datasets that include human annotations: CIFAR-10H [Peterson et al., 2019] and Image Net-16H [Steyvers et al., 2022]. The Image Net-16H dataset is available on the Open Science Foundation at https://osf.io/2ntrf/. |
| Dataset Splits | Yes | Both datasets are partitioned into three disjoint subsets: (i) a model training set DT , (ii) a combination training set DC, and (iii) an evaluation set DE. The combination training set and evaluation set are subsets of the original test sets, where 70% of the data is used for fitting the combinations and 30% is used for evaluation. |
| Hardware Specification | No | The paper describes the models and datasets used but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions several models and calibration methods, and also 'PyTorch 1.9.0' in Appendix F, but does not provide a list of multiple key software components with their specific version numbers to allow for full reproduction of the environment. |
| Experiment Setup | Yes | In Appendix F, we describe our model architectures and training procedures in detail. All models were implemented with PyTorch 1.9.0 and trained using the Adam optimizer with a batch size of 128. |