Adapting k-means Algorithms for Outliers

Authors: Christoph Grunau, Václav Rozhoň

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We tested the following algorithms on the datasets kdd (KDD Cup 1999) subsampled to 10 000 points with 38 dimensions and spam (Spambase) with 4601 points in 58 dimensions (DG17). ... The results for this setup for k {5, 10, . . . , 50} are in Figs. 1 and 2, each value being an average over 10 runs.
Researcher Affiliation Academia 1ETH Z urich. Correspondence to: Christoph Grunau <cgrunau@ethz.ch>, V aclav Rozhoˇn <rozhonv@ethz.ch>.
Pseudocode Yes Algorithm 1 k-means++ seeding; Algorithm 2 k-means++ (over)seeding with penalties; Algorithm 3 One step of Local-search++; Algorithm 4 Local-search++ with outliers; Algorithm 5 Overseeding from Guha et al. (GMM+03); Algorithm 6 k-means|| overseeding.
Open Source Code No The paper does not provide a direct link to its source code or explicitly state that the code is being released publicly.
Open Datasets Yes We tested the following algorithms on the datasets kdd (KDD Cup 1999) subsampled to 10 000 points with 38 dimensions and spam (Spambase) with 4601 points in 58 dimensions (DG17).
Dataset Splits No The paper mentions subsampled dataset sizes but does not specify training, validation, or test splits, or a cross-validation setup.
Hardware Specification No The paper does not specify any hardware details (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies No The paper does not list any specific software dependencies with version numbers.
Experiment Setup Yes We set the number of outliers z to be 10 percent of the dataset. ... To guess the value of Θ in all except the first two algorithms, we tried 10 values from 1 to 1010, exponentially separated. The best solution was then picked and we followed by running 10 Lloyd iterations on it with the number of outliers for these iterations set to z (the same for the second k-means++ algorithm). The results for this setup for k {5, 10, . . . , 50} are in Figs. 1 and 2, each value being an average over 10 runs.