reproducibilityindex.ai

Scalable Fair Clustering

Authors: Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, Tal Wagner

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We complement our theoretical analysis with empirical evaluation. Our experiments show that the quality of the clustering obtained by our algorithm is comparable to that of (Chierichetti et al., 2017). At the same time, the empirical runtime of our algorithm scales almost linearly in the number of points, making it applicable to massive data sets (see Figure 1).
Researcher Affiliation	Collaboration	2TTIC, Chicago, IL, USA 3CSAIL, MIT, Cambridge, MA, USA 4IBM T. J. Watson, Yorktown Heights, NY, USA 5Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA.
Pseudocode	Yes	Algorithm 1 FAIRLETDECOMPOSITION(v, r, b): returns an (r, b)-fairlet decomposition of the points in T(v); Algorithm 2 CLUSTERFAIRLET(Q): the algorithm returns an (r, b)-fair k-median of P given an (r, b)-fairlet decomposition Q of P; Algorithm 3 MINHEAVYPOINTS({N i r, N i b}i [γd], r, b, γ); Algorithm 4 UNBALANCEDPOINTS(Nr, Nb, r, b): returns the minimum number of points that are required to be removed so that (Nr, Nb) become (r, b)-balanced.; Algorithm 5 EXTRAPOINT(c, Nr, Nb, r, b): returns the maximum number of points of color c that can be removed from the set (Nr, Nb) such that they remain (r, b)-balanced.; Algorithm 6 NONSATFAIRLET(Nr, Nb, r, b): returns the non-saturated fairlet in a set with (Nr, Nb) points.
Open Source Code	Yes	Our code is publicly available at https://github.com/ talwagner/fair_clustering.
Open Datasets	Yes	The dataset4 represents 10 years of clinical care at 130 US hospitals and in particular represent the information and outcome of patients pertaining to diabetes (Strack et al., 2014).; The dataset5 is extracted from marketing campaigns of a Portuguese banking institution (Moro et al., 2014).; The data set6 contains the records extracted from 1994 US Census (Kohavi, 1996).; The data set7 contains the records extracted from 1990 US Census.
Dataset Splits	No	The paper does not explicitly provide details about training, validation, or test splits for its datasets. It mentions sampling for runtime analysis but not for model evaluation.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies	No	The paper mentions running an 'existing K-medoids clustering subroutine' and provides a link to MATLAB's kmedoids function, but it does not specify a version number for MATLAB or any other software dependency.
Experiment Setup	Yes	However, instead of building poly(r, b)-HST, in our implementation, we embed the points into a 2-HST. After computing a fairlet-decomposition of the points with given balance parameters, we run an existing K-medoids clustering subroutine9.; Fair Clustering Cost (k = 20); In Figure 1 and both Table 1 and 3, the reported runtime for each sample size S is the median runtime of our algorithm on 10 different sample sets from the given pointset each of size S.; Since Census dataset is not (1, 2)-balanced, we picked a lower balance-threshold for this dataset.