reproducibilityindex.ai

Proximal Stochastic Recursive Momentum Methods for Nonconvex Composite Decentralized Optimization

Authors: Gabriel Mancino-Ball, Shengnan Miao, Yangyang Xu, Jie Chen

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments In this section, we empirically validate the convergence theory of DEEPSTORM and demonstrate its effectiveness in comparison with representative decentralized methods. We compare all versions of DEEPSTORM with DSGT (Lu et al. 2019; Zhang and You 2020; Koloskova, Lin, and Stich 2021; Xin, Khan, and Kar 2021b), SPPDM (Wang et al. 2021), and Prox GT-SR-O/E (Xin et al. 2021). All experiments are conducted using the Ai MOS1 supercomputer with eight NVIDIA Tesla V100 GPUs in total, with code implemented in Py Torch (v1.6.0) and Open MPI (v3.1.4). Problems. We conduct tests on three classiﬁcation problems. Each local agent i has the objective φi(xi) = 1 M PM j=1 ℓ(g (xi, aj) , bj) + λ xi 1 , where g(x, a) is the output of a neural network with parameters x on data a, and ℓis the cross-entropy loss function between the output and the true label b. The data is uniformly randomly split among the agents, each obtaining M training examples. The L1 regularization promotes sparsity of the trained network. The regularization strength λ is set to 0.0001 following general practice. Data sets and neural networks. The three data sets we experiment with are summarized in Table 2 in Appendix A in (Mancino-Ball et al. 2022). Two of them are tabular data and we use the standard multi-layer perceptron for g (one hidden layer with 64 units). The other data set contains images; thus, we use a convolutional neural network. Both neural networks use the tanh activation to satisfy the smoothness condition of the objective function. Communication graphs. Each data set is paired with a different communication graph, indicated by, and visualized in, Table 2 in Appendix A in (Mancino-Ball et al. 2022). For the ladder and random graphs, the mixing matrix is set 1See: https://cci.rpi.edu/aimos as W = I γL, where γ is reciprocal of the maximum eigenvalue of the combinatorial Laplacian L. For the ring graph, self-weighting and neighbor weights are set to be 1 3. Performance metrics. We evaluate on four metrics: training loss, stationarity violation, solution sparsity, and test accuracy. Further, we compare the methods with respect to data passes and algorithm iterations, which reﬂect the sample complexity and communication complexity, respectively. Note that for each iteration, all methods except SPPDM communicate two variables. For the training loss, stationarity violation, and test accuracy, we evaluate on the average solution x. The stationarity violation is deﬁned as x proxr ( x f( x)) 2 2+PN i=1 xi x 2 2, which measures both optimality and consensus. For sparsity, we use the average percentage of non-zeros in each xi prior to local communication. Protocols. For hyperparameter selection, see Appendix A in (Mancino-Ball et al. 2022). We perform ten runs with different starting points for each dataset. In several runs for the MNIST dataset, DSGT and SPPDM converge to solutions with 1% non-zero entries, but the training loss and test accuracy are not competitive at all. We keep only the ﬁve best runs for reporting the (averaged) performance. Results. Table 2 summarizes the results for all performance metrics, by using the same number of data passes for all methods when convergence has been observed. For a9a and Mini Boo NE, the results are averaged over passes 80 to 100; while for MNIST, over passes 180 to 200. Figure 1 compares different methods by using the same number of algorithm iterations. Overall, we see that DEEPSTORM (all variants) generally yields a lower training loss and signiﬁcantly fewer nonzeros in the solution than the other decentralized algorithms. This observation suggests that DEEPSTORM indeed solves the optimization problem (2) much more efﬁciently in terms of both data passes and iterations. Moreover, the test accuracy is also highly competitive, concluding the practical usefulness of DEEPSTORM.
Researcher Affiliation	Collaboration	Gabriel Mancino-Ball1, Shengnan Miao1, Yangyang Xu1, Jie Chen2 1Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 2MIT IBM-Watson AI Lab, IBM Research, Cambridge, MA 02142 mancig@rpi.edu, snmiao236@gmail.com, xuy21@rpi.edu, chenjie@us.ibm.com
Pseudocode	Yes	Algorithm 1: DEEPSTORM
Open Source Code	Yes	All codes are made available at https://github.com/gmancino/DEEPSTORM.
Open Datasets	Yes	Problems. We conduct tests on three classiﬁcation problems. Data sets and neural networks. The three data sets we experiment with are summarized in Table 2 in Appendix A in (Mancino-Ball et al. 2022).
Dataset Splits	No	The paper states 'The data is uniformly randomly split among the agents' and mentions 'For hyperparameter selection, see Appendix A in (Mancino-Ball et al. 2022)', but does not explicitly provide specific train/validation/test dataset split percentages or sample counts in the main text.
Hardware Specification	Yes	All experiments are conducted using the Ai MOS1 supercomputer with eight NVIDIA Tesla V100 GPUs in total
Software Dependencies	Yes	with code implemented in Py Torch (v1.6.0) and Open MPI (v3.1.4)
Experiment Setup	Yes	The regularization strength λ is set to 0.0001 following general practice. Protocols. For hyperparameter selection, see Appendix A in (Mancino-Ball et al. 2022).