Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Approximate K-Means++ in Sublinear Time
Authors: Olivier Bachem, Mario Lucic, S. Hamed Hassani, Andreas Krause
AAAI 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conο¬rm that the proposed method is competitive with k-means++ on a variety of real-world, largescale datasets while offering a reduction in runtime of several orders of magnitude. |
| Researcher Affiliation | Academia | Olivier Bachem ETH Zurich EMAIL Mario Lucic ETH Zurich EMAIL S. Hamed Hassani ETH Zurich EMAIL Andreas Krause ETH Zurich EMAIL |
| Pseudocode | Yes | Algorithm 1 K-MC2 |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the source code of the methodology. |
| Open Datasets | Yes | We use six different datasets: USGS (United States Geological Survey, 2010), CSN (Faulkner et al., 2011), KDD (KDD Cup, 2004), BIGX (Ackermann et al., 2012), WEB (Yahoo! Labs, 2008) and SONG (Bertin-Mahieux et al., 2011). |
| Dataset Splits | No | For the datasets USGS, CSN and KDD, we set k = 200 and train all methods on the full datasets. For the datasets BIGX, WEB and SONG, we set k = 2000 and train on all but 250 000 points which we use as a holdout set for evaluation. This describes training on full datasets or using a holdout set, but does not explicitly mention distinct 'validation' splits or a detailed splitting methodology for hyperparameter tuning. |
| Hardware Specification | No | The paper does not provide any specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide any specific software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | For the datasets USGS, CSN and KDD, we set k = 200... For the datasets BIGX, WEB and SONG, we set k = 2000... We run K-MC2 with different chain lengths, i.e. m {1, 2, 5, 10, 20, 50, 100, 150, 200}... We set s {100, 200, 500, . . . , 10 000, 15 000, 20 000}... We use r = 5 rounds and a variety of oversampling factors, i.e. l {0.02k, 0.05k, 0.1k, 0.2k, 0.5k, 1k, 2k}. As all the considered methods are randomized procedures, we run them repeatedly with different initial random seeds. |