Proximal Stochastic Recursive Momentum Methods for Nonconvex Composite Decentralized Optimization
Authors: Gabriel Mancino-Ball, Shengnan Miao, Yangyang Xu, Jie Chen
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments In this section, we empirically validate the convergence theory of DEEPSTORM and demonstrate its effectiveness in comparison with representative decentralized methods. We compare all versions of DEEPSTORM with DSGT (Lu et al. 2019; Zhang and You 2020; Koloskova, Lin, and Stich 2021; Xin, Khan, and Kar 2021b), SPPDM (Wang et al. 2021), and Prox GT-SR-O/E (Xin et al. 2021). All experiments are conducted using the Ai MOS1 supercomputer with eight NVIDIA Tesla V100 GPUs in total, with code implemented in Py Torch (v1.6.0) and Open MPI (v3.1.4). Problems. We conduct tests on three classification problems. Each local agent i has the objective φi(xi) = 1 M PM j=1 ℓ(g (xi, aj) , bj) + λ xi 1 , where g(x, a) is the output of a neural network with parameters x on data a, and ℓis the cross-entropy loss function between the output and the true label b. The data is uniformly randomly split among the agents, each obtaining M training examples. The L1 regularization promotes sparsity of the trained network. The regularization strength λ is set to 0.0001 following general practice. Data sets and neural networks. The three data sets we experiment with are summarized in Table 2 in Appendix A in (Mancino-Ball et al. 2022). Two of them are tabular data and we use the standard multi-layer perceptron for g (one hidden layer with 64 units). The other data set contains images; thus, we use a convolutional neural network. Both neural networks use the tanh activation to satisfy the smoothness condition of the objective function. Communication graphs. Each data set is paired with a different communication graph, indicated by, and visualized in, Table 2 in Appendix A in (Mancino-Ball et al. 2022). For the ladder and random graphs, the mixing matrix is set 1See: https://cci.rpi.edu/aimos as W = I γL, where γ is reciprocal of the maximum eigenvalue of the combinatorial Laplacian L. For the ring graph, self-weighting and neighbor weights are set to be 1 3. Performance metrics. We evaluate on four metrics: training loss, stationarity violation, solution sparsity, and test accuracy. Further, we compare the methods with respect to data passes and algorithm iterations, which reflect the sample complexity and communication complexity, respectively. Note that for each iteration, all methods except SPPDM communicate two variables. For the training loss, stationarity violation, and test accuracy, we evaluate on the average solution x. The stationarity violation is defined as x proxr ( x f( x)) 2 2+PN i=1 xi x 2 2, which measures both optimality and consensus. For sparsity, we use the average percentage of non-zeros in each xi prior to local communication. Protocols. For hyperparameter selection, see Appendix A in (Mancino-Ball et al. 2022). We perform ten runs with different starting points for each dataset. In several runs for the MNIST dataset, DSGT and SPPDM converge to solutions with 1% non-zero entries, but the training loss and test accuracy are not competitive at all. We keep only the five best runs for reporting the (averaged) performance. Results. Table 2 summarizes the results for all performance metrics, by using the same number of data passes for all methods when convergence has been observed. For a9a and Mini Boo NE, the results are averaged over passes 80 to 100; while for MNIST, over passes 180 to 200. Figure 1 compares different methods by using the same number of algorithm iterations. Overall, we see that DEEPSTORM (all variants) generally yields a lower training loss and significantly fewer nonzeros in the solution than the other decentralized algorithms. This observation suggests that DEEPSTORM indeed solves the optimization problem (2) much more efficiently in terms of both data passes and iterations. Moreover, the test accuracy is also highly competitive, concluding the practical usefulness of DEEPSTORM. |
| Researcher Affiliation | Collaboration | Gabriel Mancino-Ball1, Shengnan Miao1, Yangyang Xu1, Jie Chen2 1Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 2MIT IBM-Watson AI Lab, IBM Research, Cambridge, MA 02142 mancig@rpi.edu, snmiao236@gmail.com, xuy21@rpi.edu, chenjie@us.ibm.com |
| Pseudocode | Yes | Algorithm 1: DEEPSTORM |
| Open Source Code | Yes | All codes are made available at https://github.com/gmancino/DEEPSTORM. |
| Open Datasets | Yes | Problems. We conduct tests on three classification problems. Data sets and neural networks. The three data sets we experiment with are summarized in Table 2 in Appendix A in (Mancino-Ball et al. 2022). |
| Dataset Splits | No | The paper states 'The data is uniformly randomly split among the agents' and mentions 'For hyperparameter selection, see Appendix A in (Mancino-Ball et al. 2022)', but does not explicitly provide specific train/validation/test dataset split percentages or sample counts in the main text. |
| Hardware Specification | Yes | All experiments are conducted using the Ai MOS1 supercomputer with eight NVIDIA Tesla V100 GPUs in total |
| Software Dependencies | Yes | with code implemented in Py Torch (v1.6.0) and Open MPI (v3.1.4) |
| Experiment Setup | Yes | The regularization strength λ is set to 0.0001 following general practice. Protocols. For hyperparameter selection, see Appendix A in (Mancino-Ball et al. 2022). |