Fair and Fast k-Center Clustering for Data Summarization

Authors: Haris Angelidakis, Adam Kurpisz, Leon Sering, Rico Zenklusen

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Abstract We consider two key issues faced by many clustering methods when used for data summarization, namely (a) an unfair representation of demographic groups and (b) distorted summarizations, where data points in the summary represent subsets of the original data of vastly different sizes. Previous work made important steps towards handling separately each of these two issues in the context of the fundamental k-Center clustering objective through the study of fast algorithms for natural models that address them. We show that it is possible to effectively address both (a) and (b) simultaneously by presenting a clustering procedure that works for a canonical combined model and (i) is fast, both in theory and practice, (ii) exhibits a worst-case constant-factor guarantee, and (iii) gives promising computational results showing that there can be significant benefits in addressing both issues together instead of sequentially.
Researcher Affiliation Academia Haris Angelidakis * 1 Adam Kurpisz * 2 Leon Sering * 2 Rico Zenklusen * 2 Abstract We consider two key issues faced by many clustering methods when used for data summarization, namely (a) an unfair representation of demographic groups and (b) distorted summarizations, where data points in the summary represent subsets of the original data of vastly different sizes. Previous work made important steps towards handling separately each of these two issues in the context of the fundamental k-Center clustering objective through the study of fast algorithms for natural models that address them. We show that it is possible to effectively address both (a) and (b) simultaneously by presenting a clustering procedure that works for a canonical combined model and (i) is fast, both in theory and practice, (ii) exhibits a worst-case constant-factor guarantee, and (iii) gives promising computational results showing that there can be significant benefits in addressing both issues together instead of sequentially. 1Co W Protocol 2Department of Mathematics, ETH Zurich, Zurich, Switzerland. Correspondence to: Adam Kurpisz <adam.kurpisz@ifor.math.ethz.ch>.
Pseudocode Yes Algorithm 1 Simplified version of our PRIV-REP-KC algorithm. Algorithm 2 Gonzalez algorithm. Algorithm 3 2-approximation for Private k-Center. Algorithm 4 rebuild Reachability. Algorithm 5 augment Flow. Algorithm 6 add Edge. Algorithm 7 remove Edge. Algorithm 8 settle Prefix. Algorithm 9 search For Edge. Algorithm 10 make Prefixes Private. Algorithm 11 make Private. Algorithm 12 create Backbones. Algorithm 13 compute Tau Aggregation. Algorithm 14 prepare Edges. Algorithm 15 compute Representative Delta Realization. Algorithm 16 realize Batch Of Backbones. Algorithm 17 select Final Centers. Algorithm 18 private Assignment Heuristic. Algorithm 19 transfer Points. Algorithm 20 PRIV-REP-KC-algorithm.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository or mention code in supplementary materials.
Open Datasets Yes We use the following real data sets: The Adult data set (Kohavi and Becker (1996)) contains records about individuals extracted from the 1994 US census (Kohavi, 1996). The Diabetes data set (Strack, De Shazo, Gennings, Olmo, Ventura, Cios, and Clore (1996)) includes information from 130 US hospitals over 10 years about patients suffering from diabetes. The Query data set (Anagnostopoulos (2019)) includes synthetic range and radius query workloads derived from Gaussian distributions over the real data set that reports crimes in Chicago. The Electric data set (Hebrail and Berard (2012)) includes 2049280 noncorrupted measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010.
Dataset Splits No The paper describes the datasets used but does not explicitly specify training, validation, or test splits with percentages, absolute counts, or references to predefined splits.
Hardware Specification Yes On regular notebook hardware equipped with Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz (8 cores) and 16 GB of RAM, our algorithm for the Diabetes instance with more than 10^5 records for k = 12 runs below 1 second and for the Electric dataset with more than 2x10^6 records and k = 32 the algorithm terminates within 2 minutes.
Software Dependencies Yes In the actual implementation we use the function select_nth_unstable (version 1.49.0) of the standard library of Rust, which is based on the quickselect portion of the pattern-defeating quicksort by Peters (2021).
Experiment Setup Yes Because JNN does not allow for setting a lower bound on the number of centers for each color, and in fact always opens the upper bound of centers for each color, we set ai = bi for all i [γ]. The required number of centers per group is set proportional to the size of the group.