Predicting Rare Events by Shrinking Towards Proportional Odds
Authors: Gregory Faletto, Jacob Bien
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 4 we demonstrate through synthetic and real data experiments that PRESTO can outperform both logistic regression on the rare class and the proportional odds model, both in settings where the differences in adjacent βk vectors are sparse, as PRESTO assumes, and in settings where these differences are not sparse. 4. Experiments To illustrate the efficacy of PRESTO, we conduct two synthetic experiments and also examine two real data sets. |
| Researcher Affiliation | Academia | Gregory Faletto 1 Jacob Bien 1 1 Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, USA. |
| Pseudocode | No | The paper describes mathematical formulations and discusses implementation details, referring to modifications of existing packages, but it does not include a structured pseudocode block or algorithm listing. |
| Open Source Code | Yes | The code generating all plots and tables is available at https://github.com/gregfaletto/presto. |
| Open Datasets | Yes | We conduct a real data experiment using the soup data set from the R ordinal package (R. H. B. Christensen, 2019). ... We present another real data experiment using the data set Pre Diabetes from the R MLData R package (Hutson et al., 2022). |
| Dataset Splits | Yes | For PRESTO, we use 5-fold cross-validation to choose a value of λn among 20 choices, selecting the λn with the best out-of-fold Brier score (other metrics, like negative log likelihood, failed because some values of λn in some folds resulted in models yielding negative probabilities, so these other metrics were undefined). ... First, we randomly split the data into training (90% of the data) and test (10%) sets. |
| Hardware Specification | Yes | The real data experiments from Section 4.3 and Appendix B were conducted in R Version 4.3.0 running on mac OS Ventura 13.3.1 on a Mac Book Pro with a 2.3 GHz Quad-Core Intel Core i5 processor and 16 GB or RAM. ... The synthetic data experiments from Sections 4.1 and 4.2, as well as Simulation Studies A and B in Appendix E, were conducted in R Version 4.2.2 running on mac OS 10.15.7 on an i Mac with a 3.5 GHz Quad-Core Intel Core i7 processor and 32 GB or RAM. |
| Software Dependencies | Yes | We used the R packages MASS (Venables & Ripley, 2002, version 7.3.58.1), simulator (Bien, 2016, version 0.2.4), ggplot2 (Wickham, 2016, version 3.3.6), cowplot (Wilke, 2020, version 1.1.1), and stargazer (Hlavac, 2022, version 5.2.3), all available for download on CRAN, as well as the base parallel package (version 4.3.0). |
| Experiment Setup | Yes | We repeat the following procedure for 700 simulations. First we generate data using n = 2500, p = 10, and K = 4. We draw a random X [ 1, 1]n p, where Xij Uniform( 1, 1) for all i {1, . . . , n} and j {1, . . . , p}. Then y {1, . . . , K}n is generated according to a relaxation of the proportional odds model; instead of (1), we generate probabilities according to (3) where the βk are generated in the following way for sparsity settings of η {1/3, 1/2}: ... We consider three possible sets of intercepts: α = (0, 3, 5), (0, 3.5, 5.5), and (0, 4, 6)... For PRESTO, we use 5-fold cross-validation to choose a value of λn among 20 choices, selecting the λn with the best out-of-fold Brier score... |