Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
K-Means Clustering with Distributed Dimensions
Authors: Hu Ding, Yu Liu, Lingxiao Huang, Jian Li
ICML 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that our algorithms outperform existing algorithms on real data-sets in the distributed dimension setting. |
| Researcher Affiliation | Academia | Hu Ding EMAIL Computer Science and Engineering, Michigan State University, East Lansing, MI, USA Yu Liu EMAIL Lingxiao Huang EMAIL Jian Li EMAIL Institute for Interdisciplinary Information Science, Tsinghua University, Beijing, China |
| Pseudocode | Yes | Algorithm 1 DISTDIM-K-MEANS; Algorithm 2 GRID |
| Open Source Code | No | The paper does not provide explicit statements or links for open-source code availability. |
| Open Datasets | Yes | We ο¬rst choose a real-world data-set from (Bache & Lichman, 2013), Year Prediction MSD which contains 105 points in R90. ...we also implement our algorithm DISTDIM-K-MEANS on another data-set Bag of Words(NYTimes) from (Bache & Lichman, 2013) |
| Dataset Splits | No | The paper does not provide specific details on dataset splits (e.g., percentages, sample counts) for training, validation, or testing, nor does it refer to predefined standard splits for these purposes. |
| Hardware Specification | No | The paper discusses computation in a distributed setting with 'multiple machines' but does not specify any particular hardware components such as CPU models, GPU models, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions using algorithms from specific research papers (Arthur & Vassilvitskii, 2007; Chawla & Gionis, 2013) as centralized subroutines, but it does not specify any software names with version numbers for implementation or analysis. |
| Experiment Setup | Yes | We randomly divide the data-set into 3 parties with each having 30 attributes (i.e., T = 3), and set k = 5-100. Also, for k-means clustering with outliers we set the number of outliers z = 500. |