Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis
Authors: Dan Garber, Ohad Shamir, Nathan Srebro
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate some of our theoretical findings we conducted experiments with single-round algorithms on synthetic data. We generated synthetic datasets using two distributions. For both distributions we used the covariance matrix X = U U> with U being a random d d orthonormal matrix and is diagonal satisfying: (1, 1) = 1, (2, 2) = 0.8, 8j 3 : (j, j) = 0.9 (j 1, j 1), i.e., δ = 0.2. One dataset was generated according to the normal distributions N(0, X), and for the second datasets we generated samples by taking x = 3/2X1/2y where y U[ 1, 1]. In both cases we set d = 300. Beyond the single-round algorithms that are based on aggregating the individual ERM solutions described so far, we propose an additional natural aggregation approach, based on aggregating the individual projection matrices. More concretely, letting {ˆv(i) i=1 denote the leading eigenvectors of the individual machines, let P1 := 1 . We then take the final estimate w to be the leading eigenvector of the aggregated matrix P1. Note that as with the sign-fixing based aggregation, this approach also resolves the sign-ambiguity in the estimates produced by the different machines, which circumvents the lower bound result of Theorem 3. For both datasets we fixed the number of machines to m = 25. We tested the estimation error (i.e., the value 1 (w>v1)2 where v1 is the leading eigenvector of X and w is the estimator) of five benchmarks vs. the per-machine sample size n: the centralized solution ˆv1, the average of the individual (unbiased) ERM solutions (normalized to unit norm),the average of ERM solutions with sign-fixing, and the leading eigenvector of the averaged projection ma- trix. We also plotted the average loss of the individual ERM solutions. Results are averaged over 400 independent runs. The results for the normal distribution appear in Figure 1. |
| Researcher Affiliation | Academia | 1Technion Israel Institute of Technology, Haifa, Israel 2Weizmann Institute of Science, Rehovot, Israel 3Toyota Technological Institute, Illinois, USA. |
| Pseudocode | Yes | Algorithm 1 SHIFT-AND-INVERT POWER METHOD, Algorithm 2 Distributed First-Order Oracle for Fλ,w(y) |
| Open Source Code | No | The paper does not provide any explicit statements about the release of its source code, nor does it include a link to a code repository or mention code in supplementary materials. |
| Open Datasets | No | We generated synthetic datasets using two distributions. For both distributions we used the covariance matrix X = U U> with U being a random d d orthonormal matrix and is diagonal satisfying: (1, 1) = 1, (2, 2) = 0.8, 8j 3 : (j, j) = 0.9 (j 1, j 1), i.e., δ = 0.2. One dataset was generated according to the normal distributions N(0, X), and for the second datasets we generated samples by taking x = 3/2X1/2y where y U[ 1, 1]. In both cases we set d = 300. |
| Dataset Splits | No | The paper states that synthetic datasets were generated and experiments were run for various per-machine sample sizes (n) and averaged over 400 independent runs. However, it does not specify any training, validation, or test dataset splits in terms of percentages, sample counts, or predefined splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments, such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper does not mention any specific software dependencies or their version numbers that would be required to replicate the experiments. |
| Experiment Setup | Yes | We generated synthetic datasets using two distributions... In both cases we set d = 300. For both datasets we fixed the number of machines to m = 25. We tested the estimation error... vs. the per-machine sample size n... Results are averaged over 400 independent runs. |