Predictive Multiplicity in Probabilistic Classification

Authors: Jamelle Watson-Daniels, David C. Parkes, Berk Ustun

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an empirical study on seven real-world risk assessment tasks. We show that probabilistic classification tasks can in fact admit competing models that assign substantially different risk estimates. Our results also demonstrate how multiplicity can disproportionately impact marginalized individuals. In this section, we present experiments on synthetic and realworld data. Our goals are to: (1) reveal dataset characteristics that impact predictive multiplicity; and (2) determine the extent to which real risk assessment tasks exhibit predictive multiplicity in practice. Our results are shown in Figure 4, and additional results are in the Appendix.
Researcher Affiliation Collaboration Jamelle Watson-Daniels1, David C. Parkes1,2, Berk Ustun3 1Harvard University 2Deep Mind 3U.C. San Diego jwatsondaniels@g.harvard.edu, parkes@eecs.harvard.edu, berk@ucsd.edu
Pseudocode No The paper describes computational procedures like the 'Outer-Approximation Algorithm' in prose, but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for its methodology is openly available.
Open Datasets Yes Altogether, we consider seven datasets that exhibit variations in sample size, number of features, and class imbalance (see Table 1 in the Appendix). For each dataset, we compute viable prediction ranges, ambiguity and discrepancy using the methods outlined in 3. When training candidate models, we adopt a grid of target predictions: P = {0.01, 0.1, 0.2, . . . , 0.9, 0.99}. We compute discrepancy by solving the MINLP Eq. (7) with CPLEX v20.1 (Diamond and Boyd 2016) on a single CPU with 16GB RAM. Our results are shown in Figure 4, and additional results are in the Appendix. mammo: breast cancer apnea: sleep apnea arrest: crime rearrest
Dataset Splits No The paper mentions training models and evaluating on test data, but it does not provide specific details about training, validation, and test dataset splits (percentages, counts, or explicit standard split references for all three).
Hardware Specification Yes We compute discrepancy by solving the MINLP Eq. (7) with CPLEX v20.1 (Diamond and Boyd 2016) on a single CPU with 16GB RAM.
Software Dependencies Yes We compute discrepancy by solving the MINLP Eq. (7) with CPLEX v20.1 (Diamond and Boyd 2016) on a single CPU with 16GB RAM.
Experiment Setup Yes When training candidate models, we adopt a grid of target predictions: P = {0.01, 0.1, 0.2, . . . , 0.9, 0.99}.