Coresets for Clustering with Fairness Constraints

Authors: Lingxiao Huang, Shaofeng Jiang, Nisheeth Vishnoi

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we assess our approach over the Adult, Bank, Diabetes and Athlete dataset, and show that the coreset sizes are much smaller than the full dataset; applying coresets indeed accelerates the running time of computing the fair clustering objective while ensuring that the resulting objective difference is small. We also achieve a speed-up to recent fair clustering algorithms [6, 7] by incorporating our coreset construction.
Researcher Affiliation Academia Lingxiao Huang Yale University, USA Shaofeng H.-C. Jiang Weizmann Institute of Science, Israel Nisheeth K. Vishnoi Yale University, USA
Pseudocode Yes Algorithm 1: Fair Median-1D(X, k) Input: X = {x1, . . . , xn} Rd lying on the real line where x1 . . . xn, an integer k [n], a number OPT as the optimal value of k-median clustering. Output: an ε-coreset S of X together with weights w : S R 0. 1 Set a threshold ξ satisfying that ξ = ε OPT 2 Consider the points from x1 to xn and group them into batches in a greedy way: each batch B is a maximal point set satisfying that (B) ξ; 3 Denote B(X) as the collection of all batches. Let S S 4 For each point x = B S, w(x) |B|; 5 Return (S, w);
Open Source Code Yes 3https://github.com/sfjiang1990/Coresets-for-Clustering-with-Fairness-Constraints.
Open Datasets Yes Our evaluation is conducted on four datasets: Adult (~50k), Bank (~45k) and Diabetes (~100k) from UCI Machine Learning Repository [23], and Athlete (~200k) from [1], which are also considered in previous papers [20, 42, 7].
Dataset Splits No To evaluate the empirical error, we draw 500 independent random samples of (F, C) and report the maximum empirical error among these samples. The paper does not specify how the datasets themselves were split into training, validation, and test sets for model training.
Hardware Specification Yes 4The experiments are conducted on a 4-Core desktop CPU with 64 GB RAM.
Software Dependencies Yes We use CPLEX [32] to solve the ILP s (with reference [32] stating "version 12 release 6") and "an important step is to compute an approximate (unconstrained) k-means clustering solution on the dataset by employing the sklearn library [39]."
Experiment Setup Yes We pick k = 3 (i.e. number of clusters) throughout our experiment. We define the empirical error as | Kz(S,F,C) Kz(X,F,C) 1| (which is the same measure as ε) for some F and C. To evaluate the empirical error, we draw 500 independent random samples of (F, C) and report the maximum empirical error among these samples.