Leveraging Structure for Improved Classification of Grouped Biased Data

Authors: Daniel Zeiberg, Shantanu Jain, Predrag Radivojac

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on synthetic and real data demonstrate the efficacy of our algorithm over suitable baselines and ablative models, spanning standard supervised and semi-supervised learning approaches, with and without incorporating the group directly as a feature.
Researcher Affiliation Academia Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, U.S.A.
Pseudocode No The algorithm is given by the following steps. 1. Cluster: Apply k-means clustering to L U, the combined pool of labeled and unlabeled data to partition X into K clusters, {Xk}K k=1. Use the silhouette coefficient (de Amorim and Hennig 2015) to determine K. 2. Estimate ρ(x) = p(y = 1|x): Estimate the labeled data posterior by training a probabilistic classifier on L using group-agnostic features only. Note that a separate classifier may be trained on each labeled cluster to estimate ρ(x), since p(y = 1|x) = p(y = 1|x, π(x)). 3. Estimate αk = p(y = 1|π = k): Estimate the proportion of positives in each cluster in L by counting the positives in the cluster and dividing by the size of the cluster. 4. Estimate αg k = p(y = 1|g, π = k): Estimate the proportion of positives in each group and cluster pair from U by applying one of the approaches used for domainadaptation under label shift; see next Section. 5. Estimate ρ(x, g) = p(y = 1|x, g): Estimate the group-aware posterior by applying the formula derived in Theorem 2, using the estimates of ρ(x), απ(x) and αg π(x), computed in the previous steps.
Open Source Code Yes Code is available at https://github.com/Dzeiberg/leveraging structure.
Open Datasets Yes Three binary classification datasets, generated from the Folktables American Community Survey (ACS) data (Ding et al. 2021), were used for evaluation: Income, Income Poverty Ratio (IPR), and Employment
Dataset Splits Yes A held-out validation set was constructed by randomly removing 20% of groups of unique examples in L.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) were explicitly provided for running the experiments.
Software Dependencies No No specific software dependencies with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) are mentioned.
Experiment Setup Yes A random forest of 500 decision trees with maximum depth of 10 was fit to each cluster in the labeled training data, splitting on the gini criterion.