reproducibilityindex.ai

Leveraging Structure for Improved Classification of Grouped Biased Data

Authors: Daniel Zeiberg, Shantanu Jain, Predrag Radivojac

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on synthetic and real data demonstrate the efficacy of our algorithm over suitable baselines and ablative models, spanning standard supervised and semi-supervised learning approaches, with and without incorporating the group directly as a feature.
Researcher Affiliation	Academia	Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, U.S.A.
Pseudocode	No	The algorithm is given by the following steps. 1. Cluster: Apply k-means clustering to L U, the combined pool of labeled and unlabeled data to partition X into K clusters, {Xk}K k=1. Use the silhouette coefficient (de Amorim and Hennig 2015) to determine K. 2. Estimate ρ(x) = p(y = 1\|x): Estimate the labeled data posterior by training a probabilistic classifier on L using group-agnostic features only. Note that a separate classifier may be trained on each labeled cluster to estimate ρ(x), since p(y = 1\|x) = p(y = 1\|x, π(x)). 3. Estimate αk = p(y = 1\|π = k): Estimate the proportion of positives in each cluster in L by counting the positives in the cluster and dividing by the size of the cluster. 4. Estimate αg k = p(y = 1\|g, π = k): Estimate the proportion of positives in each group and cluster pair from U by applying one of the approaches used for domainadaptation under label shift; see next Section. 5. Estimate ρ(x, g) = p(y = 1\|x, g): Estimate the group-aware posterior by applying the formula derived in Theorem 2, using the estimates of ρ(x), απ(x) and αg π(x), computed in the previous steps.
Open Source Code	Yes	Code is available at https://github.com/Dzeiberg/leveraging structure.
Open Datasets	Yes	Three binary classification datasets, generated from the Folktables American Community Survey (ACS) data (Ding et al. 2021), were used for evaluation: Income, Income Poverty Ratio (IPR), and Employment
Dataset Splits	Yes	A held-out validation set was constructed by randomly removing 20% of groups of unique examples in L.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) were explicitly provided for running the experiments.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) are mentioned.
Experiment Setup	Yes	A random forest of 500 decision trees with maximum depth of 10 was fit to each cluster in the labeled training data, splitting on the gini criterion.