Improved Distributed Principal Component Analysis
Authors: Yingyu Liang, Maria-Florina F Balcan, Vandana Kanchanapally, David Woodruff
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. |
| Researcher Affiliation | Collaboration | Maria-Florina Balcan School of Computer Science Carnegie Mellon University ninamf@cs.cmu.edu Vandana Kanchanapally School of Computer Science Georgia Institute of Technology vvandana@gatech.edu Yingyu Liang Department of Computer Science Princeton University yingyul@cs.princeton.edu David Woodruff Almaden Research Center IBM Research dpwoodru@us.ibm.com |
| Pseudocode | Yes | Algorithm 1 Distributed k-means clustering; Algorithm 2 Fast Distributed PCA for l2-Error Fitting |
| Open Source Code | No | The paper does not provide any statement or link indicating the release of open-source code for the described methodology. |
| Open Datasets | Yes | We choose the following real world datasets from UCI repository [1] for our experiments. For low rank approximation and k-means clustering, we choose two medium size datasets News Groups (18774 61188) and MNIST (70000 784), and two large-scale Bag-of-Words datasets: NYTimes news articles (BOWnytimes) (300000 102660) and Pub Med abstracts (BOWpubmed) (8200000 141043). We use r = 10 for rank-r approximation and k = 10 for k-means clustering. For PCR, we use MNIST and further choose Year Prediction MSD (515345 90), CTslices (53500 386), and a large dataset MNIST8m (800000 784). |
| Dataset Splits | No | The paper mentions datasets used but does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper discusses the number of nodes (servers) in the distributed setting (e.g., 's = 25 for medium-size datasets, and s = 100 for the larger ones') but does not provide specific details about the hardware used for the experiments (e.g., CPU/GPU models, memory). |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers needed to replicate the experiments. |
| Experiment Setup | Yes | The number of nodes is s = 25 for medium-size datasets, and s = 100 for the larger ones. We distribute the data over the nodes using a weighted partition, where each point is distributed to the nodes with probability proportional to the node s weight chosen from the power law with parameter α = 2. For each projection dimension, we first construct the projected data using distributed PCA... For each projection dimension and each algorithm with randomness, the average ratio over 5 runs is reported. |