Predictive Multiplicity in Classification
Authors: Charles Marx, Flavio Calmon, Berk Ustun
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we apply our tools to measure predictive multiplicity in recidivism prediction problems. We have three goals: (i) to measure the incidence of predictive multiplicity in real-world classification problems; (ii) to discuss how reporting predictive multiplicity can inform stakeholders; (iii) to show that we can also measure predictive multiplicity using existing tools, albeit imperfectly. and Datasets. We derive 8 datasets from the following studies of recidivism in the United States: and In Figure 3, we plot ambiguity and discrepancy for all possible values of the error tolerance for compas arrest, comparing the measures produced using our tools to those produced using an ad hoc analysis. In Table 2, we compare competing classifiers for compas arrest. |
| Researcher Affiliation | Academia | 1Haverford College 2Harvard SEAS 3UC San Diego. Correspondence to: Charles T. Marx <cmarx@haverford.edu>, Berk Ustun <berk@ucsd.edu>. |
| Pseudocode | Yes | Algorithm 1 Compute Discrepancy for All Values of and Algorithm 2 Compute Ambiguity for All Values of |
| Open Source Code | Yes | We include software to reproduce our results at https://github.com/charliemarx/pmtools. |
| Open Datasets | Yes | Datasets. We derive 8 datasets from the following studies of recidivism in the United States: compas from Angwin et al. 2016; pretrial from Felony Defendants in Large Urban Counties (US Dept. of Justice, 2014b); recidivism from Recidivism of Prisoners Released in 1994 (US Dept. of Justice, 2014a). ... All datasets are publicly available. We include a copy of compas arrest and compas violent with our code. The remaining datasets must be requested from ICPSR due to privacy restrictions. |
| Dataset Splits | No | No explicit validation split is mentioned for the primary experimental setup. The paper states: 'We split each dataset into a training set composed of 80% of points and a test set composed of 20% of points.' |
| Hardware Specification | Yes | We solve each MIP on a 3.33 GHz CPU with 16 GB RAM. |
| Software Dependencies | No | No specific version numbers for software dependencies are provided. The paper mentions 'MIP solver such as CPLEX, CBC, or Gurobi' and 'glmnet package of Friedman et al. (2010)' without version details. |
| Experiment Setup | Yes | We use a baseline classifier h0 that minimizes the error rate, which we fit using a MIP formulation in Appendix B. ... We allocate at most 6 hours to fit the baseline model, 6 hours to fit the models to compute discrepancy for all , and 6 hours to fit the models to compute ambiguity for all . ... a margin parameter γ > 0, which should be set to a small positive number (e.g., γ = 10 4). |