Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

K-Means Clustering with Distributed Dimensions

Authors: Hu Ding, Yu Liu, Lingxiao Huang, Jian Li

ICML 2016 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that our algorithms outperform existing algorithms on real data-sets in the distributed dimension setting.
Researcher Affiliation Academia Hu Ding EMAIL Computer Science and Engineering, Michigan State University, East Lansing, MI, USA Yu Liu EMAIL Lingxiao Huang EMAIL Jian Li EMAIL Institute for Interdisciplinary Information Science, Tsinghua University, Beijing, China
Pseudocode Yes Algorithm 1 DISTDIM-K-MEANS; Algorithm 2 GRID
Open Source Code No The paper does not provide explicit statements or links for open-source code availability.
Open Datasets Yes We first choose a real-world data-set from (Bache & Lichman, 2013), Year Prediction MSD which contains 105 points in R90. ...we also implement our algorithm DISTDIM-K-MEANS on another data-set Bag of Words(NYTimes) from (Bache & Lichman, 2013)
Dataset Splits No The paper does not provide specific details on dataset splits (e.g., percentages, sample counts) for training, validation, or testing, nor does it refer to predefined standard splits for these purposes.
Hardware Specification No The paper discusses computation in a distributed setting with 'multiple machines' but does not specify any particular hardware components such as CPU models, GPU models, or memory specifications used for the experiments.
Software Dependencies No The paper mentions using algorithms from specific research papers (Arthur & Vassilvitskii, 2007; Chawla & Gionis, 2013) as centralized subroutines, but it does not specify any software names with version numbers for implementation or analysis.
Experiment Setup Yes We randomly divide the data-set into 3 parties with each having 30 attributes (i.e., T = 3), and set k = 5-100. Also, for k-means clustering with outliers we set the number of outliers z = 500.