Near-optimal Coresets for Robust Clustering

Authors: Lingxiao Huang, Shaofeng H.-C. Jiang, Jianing Lou, Xuan Wu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement our coreset construction algorithm and evaluate its empirical performance on various real datasets. We compare it with several baselines and demonstrate the superior performance of our coreset. In addition, we show that our coresets can significantly speed up approximation algorithms for both (k, m)-ROBUST MEDIAN and (k, m)-ROBUST MEANS problems.
Researcher Affiliation Collaboration Lingxiao Huang State Key Laboratory of Novel Software Technology, Nanjing University Email: huanglingxiao1990@126.com Shaofeng H.-C. Jiang Peking University Email: shaofeng.jiang@pku.edu.cn Jianing Lou Peking University Email: loujn@pku.edu.cn Xuan Wu Huawei TCS Lab Email: wu3412790@gmail.com
Pseudocode Yes Algorithm 1 Coreset Construction for (k, z, m)-ROBUST CLUSTERING
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes Our experiments are conducted on publicly available clustering datasets, see Table 1 for a summary of specifications and choice of parameters. For all datasets, we select numerical features to form a vector in Rd for each record. For larger dataset, particularly Census1990 and Twitter, we subsample it to 105 points so that inefficient baselines can still finish in a reasonable amount of time. Unless otherwise specified, we typically set k = 5 for the number of centers. The number of outliers m is determined by a per-dataset basis, via observing the distance distribution of points to a near-optimal center (see Section C for details). All experiments are conducted on a PC with Intel Core i7 CPU and 16 GB memory, and algorithms are implemented using C++ 11. We implement our coreset following Algorithm 1 except for a few modifications. The detailed modifications are described in Section C. Table 1: Specifications of datasets and the choice of the parameters. dataset size subsample dim. # of outliers m Adult (Dua & Graff, 2017) 48842 6 200 Bank (Moro et al., 2014) 41188 10 200 Twitter (Chan et al., 2018) 21040936 105 2 500 Census1990 (Meek et al., 1990) 2458285 105 68 1500
Dataset Splits No The paper mentions subsampling datasets for efficiency but does not provide specific train/validation/test dataset splits with percentages or sample counts.
Hardware Specification Yes All experiments are conducted on a PC with Intel Core i7 CPU and 16 GB memory, and algorithms are implemented using C++ 11.
Software Dependencies Yes algorithms are implemented using C++ 11.
Experiment Setup Yes Unless otherwise specified, we typically set k = 5 for the number of centers. The number of outliers m is determined by a per-dataset basis, via observing the distance distribution of points to a near-optimal center (see Section C for details). We consider two natural algorithms and run them on top of our coreset for speedup: a Lloyd-style algorithm tailored to (k, m)-ROBUST MEANS (Chawla & Gionis, 2013) seeded by a modified k-MEANS++ for robust clustering (Bhaskara et al., 2019), which we call LL , and a local search algorithm for (k, m)-ROBUST MEDIAN (Friggstad et al., 2019), which we call LS . We note that for LS, we uniformly sample 100 points from the dataset and use them as the only potential centers, since otherwise it takes too long to run on the original dataset (without coresets).