Adapting k-means Algorithms for Outliers
Authors: Christoph Grunau, Václav Rozhoň
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We tested the following algorithms on the datasets kdd (KDD Cup 1999) subsampled to 10 000 points with 38 dimensions and spam (Spambase) with 4601 points in 58 dimensions (DG17). ... The results for this setup for k {5, 10, . . . , 50} are in Figs. 1 and 2, each value being an average over 10 runs. |
| Researcher Affiliation | Academia | 1ETH Z urich. Correspondence to: Christoph Grunau <cgrunau@ethz.ch>, V aclav Rozhoˇn <rozhonv@ethz.ch>. |
| Pseudocode | Yes | Algorithm 1 k-means++ seeding; Algorithm 2 k-means++ (over)seeding with penalties; Algorithm 3 One step of Local-search++; Algorithm 4 Local-search++ with outliers; Algorithm 5 Overseeding from Guha et al. (GMM+03); Algorithm 6 k-means|| overseeding. |
| Open Source Code | No | The paper does not provide a direct link to its source code or explicitly state that the code is being released publicly. |
| Open Datasets | Yes | We tested the following algorithms on the datasets kdd (KDD Cup 1999) subsampled to 10 000 points with 38 dimensions and spam (Spambase) with 4601 points in 58 dimensions (DG17). |
| Dataset Splits | No | The paper mentions subsampled dataset sizes but does not specify training, validation, or test splits, or a cross-validation setup. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., CPU, GPU models, or memory) used for running the experiments. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers. |
| Experiment Setup | Yes | We set the number of outliers z to be 10 percent of the dataset. ... To guess the value of Θ in all except the first two algorithms, we tried 10 values from 1 to 1010, exponentially separated. The best solution was then picked and we followed by running 10 Lloyd iterations on it with the number of outliers for these iterations set to z (the same for the second k-means++ algorithm). The results for this setup for k {5, 10, . . . , 50} are in Figs. 1 and 2, each value being an average over 10 runs. |