Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Simple, Scalable and Effective Clustering via One-Dimensional Projections

Authors: Moses Charikar, Monika Henzinger, Lunjia Hu, Maximilian Vötsch, Erik Waingarten

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we show experimentally that our clustering algorithm gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks.
Researcher Affiliation Academia Moses Charikar Stanford University Stanford, CA EMAIL Monika Henzinger Institute of Science and Technology Austria (ISTA) Klosterneuburg, Austria EMAIL Lunjia Hu Stanford University Stanford, CA EMAIL Maximilian Vötsch Faculty of Computer Science Doctoral School of Computer Science Do CS Vienna University of Vienna Vienna, Austria EMAIL Erik Waingarten Department of Computer and Information Sciences University of Pennsylvania Philadelphia, PA EMAIL
Pseudocode Yes Algorithm 1: Efficient k-means++ seeding in one dimension
Open Source Code Yes We provide our code in the supplementary material.
Open Datasets Yes KDD [47]: Training data for the 2004 KDD challenge on protein homology. The dataset consists of 145751 observations with 77 real-valued features. Song [12]: Timbre information for 515345 songs with 90 features each, used for year prediction. Census [32]: 1990 US census data with 2458285 observations, each with 68 categorical features.
Dataset Splits No The paper mentions using datasets for experiments but does not provide explicit details about train/validation/test dataset splits, such as specific percentages or sample counts.
Hardware Specification Yes All experiments were run on Linux using a notebook with a 3.9 GHz 12th generation Intel Core i7 six-core processor and 32 Gi B of RAM.
Software Dependencies No All algorithms were implemented in C++, using the blaze library for matrix and vector operations performed on the dataset unless specified differently below. We run a state-of-the-art implementation of Lloyd s k-means algorithm from the scikit-learn library [57].
Experiment Setup Yes We evaluate various choices of k ({10, 100, 1000}) as well as coresets at various relative sizes, {0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1} times the size of the dataset. We evaluated the algorithms for every k in {10, 25, 50, 100, 250, 500, 1000, 2500, 5000} and z = 2, for solving k-means with the ℓ2-metric. Each algorithm is run 5 times.