Recovering the Propensity Score from Biased Positive Unlabeled Data
Authors: Walter Gerych, Thomas Hartvigsen, Luke Buquicchio, Emmanuel Agu, Elke Rundensteiner6694-6702
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical study shows that our approach significantly outperforms the state-of-the-art propensity estimation methods on a rich variety of benchmark datasets. Through a series of extensive experiments we show that our models outperform the state-of-the-art methods by estimating propensity scores more accurately and subsequently making more accurate classifications. |
| Researcher Affiliation | Academia | 1 Worcester Polytechic Institute 2 MIT CSAIL |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states "We use the publicly-available code for Ti CE2 and SAR EM3 and implement Cluster ourselves." It does not provide concrete access to the source code for the methodology described in this paper. |
| Open Datasets | Yes | We use several standard benchmark datasets from the UCI Machine Learning Repository (Dua and Graff 2017): Yeast (Horton and Nakai 1996), Bank (Dua and Graff 2017), Wine (Aeberhard, Coomans, and De Vel 1994), HTRU 2 (Lyon et al. 2016), Occupancy (Candanedo and Feldheim 2016), and Adults (Kohavi 1996). We likewise use two real-world datasets: Yelp Reviews (Zhang, Zhao, and Le Cun 2015) and PASCAL VOC 2007 (Everingham et al. 2007). |
| Dataset Splits | Yes | Random 70/30 train/test splits were used for each dataset. |
| Hardware Specification | No | Results in this paper were obtained in part using a high-performance computing system acquired through NSF MRI grant DMS-1337943 to WPI. |
| Software Dependencies | No | We use a Gaussian Process Classifier (Rasmussen 2003) to model the label indicator posterior necessary for each method. The paper does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | Random 70/30 train/test splits were used for each dataset. Each experiment was repeated 10 times in order to obtain confidence intervals. Specifically, we cluster the distances of the positive instances from the mean into 20 bins or clusters using k-means (such that the clustering is on the distances, not the positions of the points in the feature space). Each bin is assigned a random propensity score between 0.1 and 0.9. Ten trials are run per dataset and bins are randomly assigned for each. This is achieved by applying Borderline SMOTE (Han, Wang, and Mao 2005) to generate samples along the boundary of the positive and negative class, such that negative samples were generated on the positive side and vice versa. We apply this and ensure a roughly 30% class overlap for each dataset. The ground truth propensity score in this setting is determined by first training a probabilistic classifier logistic regression model) to find the posterior of the positive class. Then, the propensity score was determined as the posterior model multiplied by a constant k, where k was randomly sampled from 0.3 to 0.8. k was re-sampled for each run, for ten runs per dataset. |