Leveraging Structure for Improved Classification of Grouped Biased Data
Authors: Daniel Zeiberg, Shantanu Jain, Predrag Radivojac
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on synthetic and real data demonstrate the efficacy of our algorithm over suitable baselines and ablative models, spanning standard supervised and semi-supervised learning approaches, with and without incorporating the group directly as a feature. |
| Researcher Affiliation | Academia | Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, U.S.A. |
| Pseudocode | No | The algorithm is given by the following steps. 1. Cluster: Apply k-means clustering to L U, the combined pool of labeled and unlabeled data to partition X into K clusters, {Xk}K k=1. Use the silhouette coefficient (de Amorim and Hennig 2015) to determine K. 2. Estimate ρ(x) = p(y = 1|x): Estimate the labeled data posterior by training a probabilistic classifier on L using group-agnostic features only. Note that a separate classifier may be trained on each labeled cluster to estimate ρ(x), since p(y = 1|x) = p(y = 1|x, π(x)). 3. Estimate αk = p(y = 1|π = k): Estimate the proportion of positives in each cluster in L by counting the positives in the cluster and dividing by the size of the cluster. 4. Estimate αg k = p(y = 1|g, π = k): Estimate the proportion of positives in each group and cluster pair from U by applying one of the approaches used for domainadaptation under label shift; see next Section. 5. Estimate ρ(x, g) = p(y = 1|x, g): Estimate the group-aware posterior by applying the formula derived in Theorem 2, using the estimates of ρ(x), απ(x) and αg π(x), computed in the previous steps. |
| Open Source Code | Yes | Code is available at https://github.com/Dzeiberg/leveraging structure. |
| Open Datasets | Yes | Three binary classification datasets, generated from the Folktables American Community Survey (ACS) data (Ding et al. 2021), were used for evaluation: Income, Income Poverty Ratio (IPR), and Employment |
| Dataset Splits | Yes | A held-out validation set was constructed by randomly removing 20% of groups of unique examples in L. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) were explicitly provided for running the experiments. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) are mentioned. |
| Experiment Setup | Yes | A random forest of 500 decision trees with maximum depth of 10 was fit to each cluster in the labeled training data, splitting on the gini criterion. |