Simple and Scalable Sparse k-means Clustering via Feature Ranking

Authors: Zhiyue Zhang, Kenneth Lange, Jason Xu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase these contributions thoroughly via simulated experiments and real data benchmarks, including a case study on protein expression in trisomic mice.
Researcher Affiliation Academia Zhiyue Zhang1 Kenneth Lange2 Jason Xu1, 1Department of Statistical Science, Duke University 2 Departments of Computational Medicine, Statistics, and Human Genetics, UCLA Correspondence to jason.q.xu@duke.edu
Pseudocode Yes Algorithm 1 SKFR1 algorithm pseudocode; Algorithm 2 SKFR2 algorithm pseudocode; Algorithm 3 SKFR permutation tuning pseudocode
Open Source Code No The paper does not provide any explicit statements about releasing source code, nor does it include a link to a code repository.
Open Datasets Yes Benchmark datasets To further validate our proposed algorithm, we compare SKFR1 to widely used peer algorithms on 10 benchmark datasets collected from the Keel, ASU, and UCI machine learning repositories. ... a mice protein expression dataset from a study of murine Down Syndrome [49].
Dataset Splits No The paper describes simulation setups and the number of trials and restarts, and uses methods like the gap statistic for parameter tuning, but it does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts) for reproducibility.
Hardware Specification No The paper mentions 'Julia 1.1 implementation' and reports runtime, but it does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for the experiments.
Software Dependencies No The paper mentions 'Julia 1.1' and several R packages (sparcl, Gmedian, kpodclustr, wskm) by name, but it does not provide specific version numbers for these R packages, which are key ancillary software components.
Experiment Setup Yes In all simulations, the number of informative features s is chosen to be 10, and we explore a range of sparsity levels by varying the total number of features p (20, 50, 100, 200, 500, 1000). The SKFR variant and all the competing algorithms are seeded by the k-means++ initialization scheme [18]. We run 30 trials with 20 restarts per trial. We tune SKM s ℓ1 bound parameter over the range [2, 10] by the gap statistic.