Near-optimal Coresets for Robust Clustering
Authors: Lingxiao Huang, Shaofeng H.-C. Jiang, Jianing Lou, Xuan Wu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement our coreset construction algorithm and evaluate its empirical performance on various real datasets. We compare it with several baselines and demonstrate the superior performance of our coreset. In addition, we show that our coresets can significantly speed up approximation algorithms for both (k, m)-ROBUST MEDIAN and (k, m)-ROBUST MEANS problems. |
| Researcher Affiliation | Collaboration | Lingxiao Huang State Key Laboratory of Novel Software Technology, Nanjing University Email: huanglingxiao1990@126.com Shaofeng H.-C. Jiang Peking University Email: shaofeng.jiang@pku.edu.cn Jianing Lou Peking University Email: loujn@pku.edu.cn Xuan Wu Huawei TCS Lab Email: wu3412790@gmail.com |
| Pseudocode | Yes | Algorithm 1 Coreset Construction for (k, z, m)-ROBUST CLUSTERING |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Our experiments are conducted on publicly available clustering datasets, see Table 1 for a summary of specifications and choice of parameters. For all datasets, we select numerical features to form a vector in Rd for each record. For larger dataset, particularly Census1990 and Twitter, we subsample it to 105 points so that inefficient baselines can still finish in a reasonable amount of time. Unless otherwise specified, we typically set k = 5 for the number of centers. The number of outliers m is determined by a per-dataset basis, via observing the distance distribution of points to a near-optimal center (see Section C for details). All experiments are conducted on a PC with Intel Core i7 CPU and 16 GB memory, and algorithms are implemented using C++ 11. We implement our coreset following Algorithm 1 except for a few modifications. The detailed modifications are described in Section C. Table 1: Specifications of datasets and the choice of the parameters. dataset size subsample dim. # of outliers m Adult (Dua & Graff, 2017) 48842 6 200 Bank (Moro et al., 2014) 41188 10 200 Twitter (Chan et al., 2018) 21040936 105 2 500 Census1990 (Meek et al., 1990) 2458285 105 68 1500 |
| Dataset Splits | No | The paper mentions subsampling datasets for efficiency but does not provide specific train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | All experiments are conducted on a PC with Intel Core i7 CPU and 16 GB memory, and algorithms are implemented using C++ 11. |
| Software Dependencies | Yes | algorithms are implemented using C++ 11. |
| Experiment Setup | Yes | Unless otherwise specified, we typically set k = 5 for the number of centers. The number of outliers m is determined by a per-dataset basis, via observing the distance distribution of points to a near-optimal center (see Section C for details). We consider two natural algorithms and run them on top of our coreset for speedup: a Lloyd-style algorithm tailored to (k, m)-ROBUST MEANS (Chawla & Gionis, 2013) seeded by a modified k-MEANS++ for robust clustering (Bhaskara et al., 2019), which we call LL , and a local search algorithm for (k, m)-ROBUST MEDIAN (Friggstad et al., 2019), which we call LS . We note that for LS, we uniformly sample 100 points from the dataset and use them as the only potential centers, since otherwise it takes too long to run on the original dataset (without coresets). |