reproducibilityindex.ai

Recovering the Propensity Score from Biased Positive Unlabeled Data

Authors: Walter Gerych, Thomas Hartvigsen, Luke Buquicchio, Emmanuel Agu, Elke Rundensteiner6694-6702

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical study shows that our approach significantly outperforms the state-of-the-art propensity estimation methods on a rich variety of benchmark datasets. Through a series of extensive experiments we show that our models outperform the state-of-the-art methods by estimating propensity scores more accurately and subsequently making more accurate classifications.
Researcher Affiliation	Academia	1 Worcester Polytechic Institute 2 MIT CSAIL
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states "We use the publicly-available code for Ti CE2 and SAR EM3 and implement Cluster ourselves." It does not provide concrete access to the source code for the methodology described in this paper.
Open Datasets	Yes	We use several standard benchmark datasets from the UCI Machine Learning Repository (Dua and Graff 2017): Yeast (Horton and Nakai 1996), Bank (Dua and Graff 2017), Wine (Aeberhard, Coomans, and De Vel 1994), HTRU 2 (Lyon et al. 2016), Occupancy (Candanedo and Feldheim 2016), and Adults (Kohavi 1996). We likewise use two real-world datasets: Yelp Reviews (Zhang, Zhao, and Le Cun 2015) and PASCAL VOC 2007 (Everingham et al. 2007).
Dataset Splits	Yes	Random 70/30 train/test splits were used for each dataset.
Hardware Specification	No	Results in this paper were obtained in part using a high-performance computing system acquired through NSF MRI grant DMS-1337943 to WPI.
Software Dependencies	No	We use a Gaussian Process Classifier (Rasmussen 2003) to model the label indicator posterior necessary for each method. The paper does not provide specific version numbers for software dependencies.
Experiment Setup	Yes	Random 70/30 train/test splits were used for each dataset. Each experiment was repeated 10 times in order to obtain confidence intervals. Specifically, we cluster the distances of the positive instances from the mean into 20 bins or clusters using k-means (such that the clustering is on the distances, not the positions of the points in the feature space). Each bin is assigned a random propensity score between 0.1 and 0.9. Ten trials are run per dataset and bins are randomly assigned for each. This is achieved by applying Borderline SMOTE (Han, Wang, and Mao 2005) to generate samples along the boundary of the positive and negative class, such that negative samples were generated on the positive side and vice versa. We apply this and ensure a roughly 30% class overlap for each dataset. The ground truth propensity score in this setting is determined by first training a probabilistic classifier logistic regression model) to find the posterior of the positive class. Then, the propensity score was determined as the posterior model multiplied by a constant k, where k was randomly sampled from 0.3 to 0.8. k was re-sampled for each run, for ten runs per dataset.