Fair and Fast k-Center Clustering for Data Summarization
Authors: Haris Angelidakis, Adam Kurpisz, Leon Sering, Rico Zenklusen
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Abstract We consider two key issues faced by many clustering methods when used for data summarization, namely (a) an unfair representation of demographic groups and (b) distorted summarizations, where data points in the summary represent subsets of the original data of vastly different sizes. Previous work made important steps towards handling separately each of these two issues in the context of the fundamental k-Center clustering objective through the study of fast algorithms for natural models that address them. We show that it is possible to effectively address both (a) and (b) simultaneously by presenting a clustering procedure that works for a canonical combined model and (i) is fast, both in theory and practice, (ii) exhibits a worst-case constant-factor guarantee, and (iii) gives promising computational results showing that there can be significant benefits in addressing both issues together instead of sequentially. |
| Researcher Affiliation | Academia | Haris Angelidakis * 1 Adam Kurpisz * 2 Leon Sering * 2 Rico Zenklusen * 2 Abstract We consider two key issues faced by many clustering methods when used for data summarization, namely (a) an unfair representation of demographic groups and (b) distorted summarizations, where data points in the summary represent subsets of the original data of vastly different sizes. Previous work made important steps towards handling separately each of these two issues in the context of the fundamental k-Center clustering objective through the study of fast algorithms for natural models that address them. We show that it is possible to effectively address both (a) and (b) simultaneously by presenting a clustering procedure that works for a canonical combined model and (i) is fast, both in theory and practice, (ii) exhibits a worst-case constant-factor guarantee, and (iii) gives promising computational results showing that there can be significant benefits in addressing both issues together instead of sequentially. 1Co W Protocol 2Department of Mathematics, ETH Zurich, Zurich, Switzerland. Correspondence to: Adam Kurpisz <adam.kurpisz@ifor.math.ethz.ch>. |
| Pseudocode | Yes | Algorithm 1 Simplified version of our PRIV-REP-KC algorithm. Algorithm 2 Gonzalez algorithm. Algorithm 3 2-approximation for Private k-Center. Algorithm 4 rebuild Reachability. Algorithm 5 augment Flow. Algorithm 6 add Edge. Algorithm 7 remove Edge. Algorithm 8 settle Prefix. Algorithm 9 search For Edge. Algorithm 10 make Prefixes Private. Algorithm 11 make Private. Algorithm 12 create Backbones. Algorithm 13 compute Tau Aggregation. Algorithm 14 prepare Edges. Algorithm 15 compute Representative Delta Realization. Algorithm 16 realize Batch Of Backbones. Algorithm 17 select Final Centers. Algorithm 18 private Assignment Heuristic. Algorithm 19 transfer Points. Algorithm 20 PRIV-REP-KC-algorithm. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository or mention code in supplementary materials. |
| Open Datasets | Yes | We use the following real data sets: The Adult data set (Kohavi and Becker (1996)) contains records about individuals extracted from the 1994 US census (Kohavi, 1996). The Diabetes data set (Strack, De Shazo, Gennings, Olmo, Ventura, Cios, and Clore (1996)) includes information from 130 US hospitals over 10 years about patients suffering from diabetes. The Query data set (Anagnostopoulos (2019)) includes synthetic range and radius query workloads derived from Gaussian distributions over the real data set that reports crimes in Chicago. The Electric data set (Hebrail and Berard (2012)) includes 2049280 noncorrupted measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010. |
| Dataset Splits | No | The paper describes the datasets used but does not explicitly specify training, validation, or test splits with percentages, absolute counts, or references to predefined splits. |
| Hardware Specification | Yes | On regular notebook hardware equipped with Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz (8 cores) and 16 GB of RAM, our algorithm for the Diabetes instance with more than 10^5 records for k = 12 runs below 1 second and for the Electric dataset with more than 2x10^6 records and k = 32 the algorithm terminates within 2 minutes. |
| Software Dependencies | Yes | In the actual implementation we use the function select_nth_unstable (version 1.49.0) of the standard library of Rust, which is based on the quickselect portion of the pattern-defeating quicksort by Peters (2021). |
| Experiment Setup | Yes | Because JNN does not allow for setting a lower bound on the number of centers for each color, and in fact always opens the upper bound of centers for each color, we set ai = bi for all i [γ]. The required number of centers per group is set proportional to the size of the group. |