LSDS++ : Dual Sampling for Accelerated k-means++

Authors: Chenglin Fan, Ping Li, Xiaoyun Li

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are conducted to justify the benefit of LSDS++ in practice.
Researcher Affiliation Collaboration Chenglin Fan Ping Li, Xiaoyun Li CSE Department, Pennsylvania State University Linked In Ads 201 Old Main, University Park, PA 16802, USA 700 Bellevue Way NE, Bellevue, WA 98004, USA fanchenglin@gmail.com {pinli, xiaoyli}@linkedin.com
Pseudocode Yes Algorithm 1 One step of local search (Arya et al., 2004) [...] Algorithm 5 One step of LSDS++
Open Source Code No The paper does not contain any statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We conduct our experiments on five benchmark datasets: DNA (Hsu and Lin, 2002): a dataset with 2000 instances of DNA sequences and 180 features. The dataset is available at LIBSVM repository1. RNA (Uzilov et al., 2006): this dataset contains 59535 samples with 8 features. The dataset was used to study the detection of non-coding RNAs (Uzilov et al., 2006). PHY: 50000 data points from particle physics with 78 features. The data were obtained from high-energy collision experiments and used in KDD Cup 20042. BIO: 145751 data samples with 74 features that measure the match between a protein and a native sequence. This dataset was also used in KDD Cup 2004. MNIST (Le Cun et al., 1998): a hand-written digit dataset with 60000 samples and 784 features.
Dataset Splits No The paper does not explicitly provide details about training, validation, or test dataset splits. k-means is an unsupervised clustering algorithm, and the evaluation focuses on the overall clustering cost rather than train/validation/test performance in a supervised learning sense.
Hardware Specification No No specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments are mentioned in the paper.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or specific libraries).
Experiment Setup Yes In our experiments, following the setting in Lattanzi and Sohler (2019), we first run k-means++ to obtain an initial center set C0. Then, starting from C0, we run 500 steps of Local Search++ and LSDS++, respectively. Moreover, we also test the performance of running 20 standard local search steps (Arya et al., 2004) starting from the initial centers found by the above algorithms, respectively. We test the number of clusters k {3, 5, 10, 20, 30, 50}. All the presented results are averaged over 10 independent runs.