Coresets for Clustering with Fairness Constraints
Authors: Lingxiao Huang, Shaofeng Jiang, Nisheeth Vishnoi
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we assess our approach over the Adult, Bank, Diabetes and Athlete dataset, and show that the coreset sizes are much smaller than the full dataset; applying coresets indeed accelerates the running time of computing the fair clustering objective while ensuring that the resulting objective difference is small. We also achieve a speed-up to recent fair clustering algorithms [6, 7] by incorporating our coreset construction. |
| Researcher Affiliation | Academia | Lingxiao Huang Yale University, USA Shaofeng H.-C. Jiang Weizmann Institute of Science, Israel Nisheeth K. Vishnoi Yale University, USA |
| Pseudocode | Yes | Algorithm 1: Fair Median-1D(X, k) Input: X = {x1, . . . , xn} Rd lying on the real line where x1 . . . xn, an integer k [n], a number OPT as the optimal value of k-median clustering. Output: an ε-coreset S of X together with weights w : S R 0. 1 Set a threshold ξ satisfying that ξ = ε OPT 2 Consider the points from x1 to xn and group them into batches in a greedy way: each batch B is a maximal point set satisfying that (B) ξ; 3 Denote B(X) as the collection of all batches. Let S S 4 For each point x = B S, w(x) |B|; 5 Return (S, w); |
| Open Source Code | Yes | 3https://github.com/sfjiang1990/Coresets-for-Clustering-with-Fairness-Constraints. |
| Open Datasets | Yes | Our evaluation is conducted on four datasets: Adult (~50k), Bank (~45k) and Diabetes (~100k) from UCI Machine Learning Repository [23], and Athlete (~200k) from [1], which are also considered in previous papers [20, 42, 7]. |
| Dataset Splits | No | To evaluate the empirical error, we draw 500 independent random samples of (F, C) and report the maximum empirical error among these samples. The paper does not specify how the datasets themselves were split into training, validation, and test sets for model training. |
| Hardware Specification | Yes | 4The experiments are conducted on a 4-Core desktop CPU with 64 GB RAM. |
| Software Dependencies | Yes | We use CPLEX [32] to solve the ILP s (with reference [32] stating "version 12 release 6") and "an important step is to compute an approximate (unconstrained) k-means clustering solution on the dataset by employing the sklearn library [39]." |
| Experiment Setup | Yes | We pick k = 3 (i.e. number of clusters) throughout our experiment. We define the empirical error as | Kz(S,F,C) Kz(X,F,C) 1| (which is the same measure as ε) for some F and C. To evaluate the empirical error, we draw 500 independent random samples of (F, C) and report the maximum empirical error among these samples. |