reproducibilityindex.ai

Predictive Multiplicity in Classification

Authors: Charles Marx, Flavio Calmon, Berk Ustun

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we apply our tools to measure predictive multiplicity in recidivism prediction problems. We have three goals: (i) to measure the incidence of predictive multiplicity in real-world classiﬁcation problems; (ii) to discuss how reporting predictive multiplicity can inform stakeholders; (iii) to show that we can also measure predictive multiplicity using existing tools, albeit imperfectly. and Datasets. We derive 8 datasets from the following studies of recidivism in the United States: and In Figure 3, we plot ambiguity and discrepancy for all possible values of the error tolerance for compas arrest, comparing the measures produced using our tools to those produced using an ad hoc analysis. In Table 2, we compare competing classiﬁers for compas arrest.
Researcher Affiliation	Academia	1Haverford College 2Harvard SEAS 3UC San Diego. Correspondence to: Charles T. Marx <cmarx@haverford.edu>, Berk Ustun <berk@ucsd.edu>.
Pseudocode	Yes	Algorithm 1 Compute Discrepancy for All Values of and Algorithm 2 Compute Ambiguity for All Values of
Open Source Code	Yes	We include software to reproduce our results at https://github.com/charliemarx/pmtools.
Open Datasets	Yes	Datasets. We derive 8 datasets from the following studies of recidivism in the United States: compas from Angwin et al. 2016; pretrial from Felony Defendants in Large Urban Counties (US Dept. of Justice, 2014b); recidivism from Recidivism of Prisoners Released in 1994 (US Dept. of Justice, 2014a). ... All datasets are publicly available. We include a copy of compas arrest and compas violent with our code. The remaining datasets must be requested from ICPSR due to privacy restrictions.
Dataset Splits	No	No explicit validation split is mentioned for the primary experimental setup. The paper states: 'We split each dataset into a training set composed of 80% of points and a test set composed of 20% of points.'
Hardware Specification	Yes	We solve each MIP on a 3.33 GHz CPU with 16 GB RAM.
Software Dependencies	No	No specific version numbers for software dependencies are provided. The paper mentions 'MIP solver such as CPLEX, CBC, or Gurobi' and 'glmnet package of Friedman et al. (2010)' without version details.
Experiment Setup	Yes	We use a baseline classiﬁer h0 that minimizes the error rate, which we ﬁt using a MIP formulation in Appendix B. ... We allocate at most 6 hours to ﬁt the baseline model, 6 hours to ﬁt the models to compute discrepancy for all , and 6 hours to ﬁt the models to compute ambiguity for all . ... a margin parameter γ > 0, which should be set to a small positive number (e.g., γ = 10 4).