Simple, Scalable and Effective Clustering via One-Dimensional Projections

Authors: Moses Charikar, Monika Henzinger, Lunjia Hu, Maximilian Vötsch, Erik Waingarten

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we show experimentally that our clustering algorithm gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks.
Researcher Affiliation Academia Moses Charikar Stanford University Stanford, CA moses@cs.stanford.edu Monika Henzinger Institute of Science and Technology Austria (ISTA) Klosterneuburg, Austria monika.henzinger@ist.ac.at Lunjia Hu Stanford University Stanford, CA lunjia@stanford.edu Maximilian Vötsch Faculty of Computer Science Doctoral School of Computer Science Do CS Vienna University of Vienna Vienna, Austria maximilian.voetsch@univie.ac.at Erik Waingarten Department of Computer and Information Sciences University of Pennsylvania Philadelphia, PA ewaingar@seas.upenn.edu
Pseudocode Yes Algorithm 1: Efficient k-means++ seeding in one dimension
Open Source Code Yes We provide our code in the supplementary material.
Open Datasets Yes KDD [47]: Training data for the 2004 KDD challenge on protein homology. The dataset consists of 145751 observations with 77 real-valued features. Song [12]: Timbre information for 515345 songs with 90 features each, used for year prediction. Census [32]: 1990 US census data with 2458285 observations, each with 68 categorical features.
Dataset Splits No The paper mentions using datasets for experiments but does not provide explicit details about train/validation/test dataset splits, such as specific percentages or sample counts.
Hardware Specification Yes All experiments were run on Linux using a notebook with a 3.9 GHz 12th generation Intel Core i7 six-core processor and 32 Gi B of RAM.
Software Dependencies No All algorithms were implemented in C++, using the blaze library for matrix and vector operations performed on the dataset unless specified differently below. We run a state-of-the-art implementation of Lloyd s k-means algorithm from the scikit-learn library [57].
Experiment Setup Yes We evaluate various choices of k ({10, 100, 1000}) as well as coresets at various relative sizes, {0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1} times the size of the dataset. We evaluated the algorithms for every k in {10, 25, 50, 100, 250, 500, 1000, 2500, 5000} and z = 2, for solving k-means with the ℓ2-metric. Each algorithm is run 5 times.