Approximate K-Means++ in Sublinear Time
Authors: Olivier Bachem, Mario Lucic, S. Hamed Hassani, Andreas Krause
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments confirm that the proposed method is competitive with k-means++ on a variety of real-world, largescale datasets while offering a reduction in runtime of several orders of magnitude. |
| Researcher Affiliation | Academia | Olivier Bachem ETH Zurich olivier.bachem@inf.ethz.ch Mario Lucic ETH Zurich lucic@inf.ethz.ch S. Hamed Hassani ETH Zurich hamed@inf.ethz.ch Andreas Krause ETH Zurich krausea@ethz.ch |
| Pseudocode | Yes | Algorithm 1 K-MC2 |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the source code of the methodology. |
| Open Datasets | Yes | We use six different datasets: USGS (United States Geological Survey, 2010), CSN (Faulkner et al., 2011), KDD (KDD Cup, 2004), BIGX (Ackermann et al., 2012), WEB (Yahoo! Labs, 2008) and SONG (Bertin-Mahieux et al., 2011). |
| Dataset Splits | No | For the datasets USGS, CSN and KDD, we set k = 200 and train all methods on the full datasets. For the datasets BIGX, WEB and SONG, we set k = 2000 and train on all but 250 000 points which we use as a holdout set for evaluation. This describes training on full datasets or using a holdout set, but does not explicitly mention distinct 'validation' splits or a detailed splitting methodology for hyperparameter tuning. |
| Hardware Specification | No | The paper does not provide any specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide any specific software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | For the datasets USGS, CSN and KDD, we set k = 200... For the datasets BIGX, WEB and SONG, we set k = 2000... We run K-MC2 with different chain lengths, i.e. m {1, 2, 5, 10, 20, 50, 100, 150, 200}... We set s {100, 200, 500, . . . , 10 000, 15 000, 20 000}... We use r = 5 rounds and a variety of oversampling factors, i.e. l {0.02k, 0.05k, 0.1k, 0.2k, 0.5k, 1k, 2k}. As all the considered methods are randomized procedures, we run them repeatedly with different initial random seeds. |