Improved Distributed Principal Component Analysis

Authors: Yingyu Liang, Maria-Florina F Balcan, Vandana Kanchanapally, David Woodruff

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality.
Researcher Affiliation Collaboration Maria-Florina Balcan School of Computer Science Carnegie Mellon University ninamf@cs.cmu.edu Vandana Kanchanapally School of Computer Science Georgia Institute of Technology vvandana@gatech.edu Yingyu Liang Department of Computer Science Princeton University yingyul@cs.princeton.edu David Woodruff Almaden Research Center IBM Research dpwoodru@us.ibm.com
Pseudocode Yes Algorithm 1 Distributed k-means clustering; Algorithm 2 Fast Distributed PCA for l2-Error Fitting
Open Source Code No The paper does not provide any statement or link indicating the release of open-source code for the described methodology.
Open Datasets Yes We choose the following real world datasets from UCI repository [1] for our experiments. For low rank approximation and k-means clustering, we choose two medium size datasets News Groups (18774 61188) and MNIST (70000 784), and two large-scale Bag-of-Words datasets: NYTimes news articles (BOWnytimes) (300000 102660) and Pub Med abstracts (BOWpubmed) (8200000 141043). We use r = 10 for rank-r approximation and k = 10 for k-means clustering. For PCR, we use MNIST and further choose Year Prediction MSD (515345 90), CTslices (53500 386), and a large dataset MNIST8m (800000 784).
Dataset Splits No The paper mentions datasets used but does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper discusses the number of nodes (servers) in the distributed setting (e.g., 's = 25 for medium-size datasets, and s = 100 for the larger ones') but does not provide specific details about the hardware used for the experiments (e.g., CPU/GPU models, memory).
Software Dependencies No The paper does not provide specific software dependencies with version numbers needed to replicate the experiments.
Experiment Setup Yes The number of nodes is s = 25 for medium-size datasets, and s = 100 for the larger ones. We distribute the data over the nodes using a weighted partition, where each point is distributed to the nodes with probability proportional to the node s weight chosen from the power law with parameter α = 2. For each projection dimension, we first construct the projected data using distributed PCA... For each projection dimension and each algorithm with randomness, the average ratio over 5 runs is reported.