reproducibilityindex.ai

Coresets for Clustering with Fairness Constraints

Authors: Lingxiao Huang, Shaofeng Jiang, Nisheeth Vishnoi

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we assess our approach over the Adult, Bank, Diabetes and Athlete dataset, and show that the coreset sizes are much smaller than the full dataset; applying coresets indeed accelerates the running time of computing the fair clustering objective while ensuring that the resulting objective difference is small. We also achieve a speed-up to recent fair clustering algorithms [6, 7] by incorporating our coreset construction.
Researcher Affiliation	Academia	Lingxiao Huang Yale University, USA Shaofeng H.-C. Jiang Weizmann Institute of Science, Israel Nisheeth K. Vishnoi Yale University, USA
Pseudocode	Yes	Algorithm 1: Fair Median-1D(X, k) Input: X = {x1, . . . , xn} Rd lying on the real line where x1 . . . xn, an integer k [n], a number OPT as the optimal value of k-median clustering. Output: an ε-coreset S of X together with weights w : S R 0. 1 Set a threshold ξ satisfying that ξ = ε OPT 2 Consider the points from x1 to xn and group them into batches in a greedy way: each batch B is a maximal point set satisfying that (B) ξ; 3 Denote B(X) as the collection of all batches. Let S S 4 For each point x = B S, w(x) \|B\|; 5 Return (S, w);
Open Source Code	Yes	3https://github.com/sfjiang1990/Coresets-for-Clustering-with-Fairness-Constraints.
Open Datasets	Yes	Our evaluation is conducted on four datasets: Adult (~50k), Bank (~45k) and Diabetes (~100k) from UCI Machine Learning Repository [23], and Athlete (~200k) from [1], which are also considered in previous papers [20, 42, 7].
Dataset Splits	No	To evaluate the empirical error, we draw 500 independent random samples of (F, C) and report the maximum empirical error among these samples. The paper does not specify how the datasets themselves were split into training, validation, and test sets for model training.
Hardware Specification	Yes	4The experiments are conducted on a 4-Core desktop CPU with 64 GB RAM.
Software Dependencies	Yes	We use CPLEX [32] to solve the ILP s (with reference [32] stating "version 12 release 6") and "an important step is to compute an approximate (unconstrained) k-means clustering solution on the dataset by employing the sklearn library [39]."
Experiment Setup	Yes	We pick k = 3 (i.e. number of clusters) throughout our experiment. We deﬁne the empirical error as \| Kz(S,F,C) Kz(X,F,C) 1\| (which is the same measure as ε) for some F and C. To evaluate the empirical error, we draw 500 independent random samples of (F, C) and report the maximum empirical error among these samples.