Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On ADMM in Deep Learning: Convergence and Saturation-Avoidance
Authors: Jinshan Zeng, Shao-Bo Lin, Yuan Yao, Ding-Xuan Zhou
JMLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we develop an alternating direction method of multipliers (ADMM) for deep neural networks training with sigmoid-type activation functions (called sigmoid-ADMM pair), mainly motivated by the gradient-free nature of ADMM in avoiding the saturation of sigmoid-type activations and the advantages of deep neural networks with sigmoid-type activations (called deep sigmoid nets) over their rectified linear unit (Re LU) counterparts (called deep Re LU nets) in terms of approximation. In particular, we prove that the approximation capability of deep sigmoid nets is not worse than that of deep Re LU nets by showing that Re LU activation function can be well approximated by deep sigmoid nets with two hidden layers and finitely many free parameters but not vice-verse. We also establish the global convergence of the proposed ADMM for the nonlinearly constrained formulation of the deep sigmoid nets training from arbitrary initial points to a Karush Kuhn-Tucker (KKT) point at a rate of order O(1/k). Besides sigmoid activation, such a convergence theorem holds for a general class of smooth activations. Compared with the widely used stochastic gradient descent (SGD) algorithm for the deep Re LU nets training (called Re LU-SGD pair), the proposed sigmoid-ADMM pair is practically stable with respect to the algorithmic hyperparameters including the learning rate, initial schemes and the pro-processing of the input data. Moreover, we find that to approximate and learn simple but important functions the proposed sigmoid-ADMM pair numerically outperforms the Re LU-SGD pair. |
| Researcher Affiliation | Academia | Jinshan Zeng EMAIL School of Computer and Information Engineering, Jiangxi Normal University, Nanchang, China Liu Bie Ju Centre for Mathematical Sciences, City University of Hong Kong, Hong Kong Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong Shao-Bo Lin EMAIL Center of Intelligent Decision-Making and Machine Learning, School of Management, Xi an Jiaotong University, Xi an, China Yuan Yao EMAIL Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong Ding-Xuan Zhou EMAIL School of Data Science and Department of Mathematics, City University of Hong Kong, Hong Kong |
| Pseudocode | Yes | Algorithm 1 ADMM for Deep Sigmoid Nets Training Samples: X := [x1, . . . , xn] Rd0 n, Y := [y1, . . . , yn] Rd N n. Initialization: ({W 0 i }N i=1, {V 0 i }N i=1, {Λ0 i }N i=1) is set according to (7). V k 0 X, k N. Parameters: λ > 0, βi > 0, i = 1, . . . , N. for k = 1, . . . do (Backward Estimation) for i = N : 1 : 1 do Update W k N via (8) and the other W k i via (19). end for (Forward Prediction) for j = 1 : N do Update V k j (j = 1, . . . , N 2) via (20), V k N 1 via (11), and V k N via (12). end for (Updating Multipliers) Λk i = Λk 1 i + βi(σ(W k i V k i 1) V k i ), i = 1, . . . , N 1, Λk N = Λk 1 N + βN(W k NV k N 1 V k N). k k + 1 end for |
| Open Source Code | Yes | The codes are available at https://github.com/Jinshan Zeng/ADMM-Deep Learning. |
| Open Datasets | Yes | 6.1 Earthquake intensity dataset Earthquake Intensity Database is from: https://www.ngdc.noaa.gov/hazard/intintro.shtml. This database contains more than 157,000 reports on over 20,000 earthquakes that affected the United States from the year 1638 to 1985. ... 6.2 Extended Yale B face recognition database In the extended Yale B (EYB) database, well-known face recognition database (Lee et al., 2005), there are in total 2432 images for 38 objects under 9 poses and 64 illumination conditions, where for each objective, there are 64 images. ... 6.3 PTB Diagnostic ECG database An ECG is a 1D signal which is the result of recording the electrical activity of the heart using an electrode. It is one of popular tools that cardiologists use to diagnose heart anomalies and diseases. The PTB diagnostic ECG database is available at https://github.com/ CVx Tz/ECG_Heartbeat_Classification and was preprocessed by (Kachuee et al., 2018). |
| Dataset Splits | Yes | For the Earthquake Intensity Database: We divide the total data set into the training and test sets randomly, where the training and test sample sizes are 4173 and 4000, respectively. For the extended Yale B (EYB) database: In our experiments, we randomly divide these 64 images for each objective into two equal parts, that is, one half of images are used for training while the rest half of images are used for testing. For the PTB Diagnostic ECG database: There are 14,552 samples in total with 2 categories. (Split details not explicitly stated for train/test counts or percentages, but for EYB it's |
| Hardware Specification | Yes | All numerical experiments were carried out in Matlab R2015b environment running Windows 10, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.2GHz 3.2GH. ... This research made use of the computing resources of the X-GPU cluster supported by the Hong Kong Research Grant Council Collaborative Research Fund: C6021-19EF. |
| Software Dependencies | Yes | All numerical experiments were carried out in Matlab R2015b environment running Windows 10, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.2GHz 3.2GH. |
| Experiment Setup | Yes | 5.1 Experimental settings In all our experiments, we use deep fully connected neural networks with different depths and widths. Throughout the paper, the depth and width of deep neural networks are respectively the number of hidden layers and number of neurons in each hidden layer. For simplicity, we only consider deep neural networks with the same width for all the hidden layers. We consider both deep sigmoid nets and deep Re LU nets in the simulation. ... For ADMM, we empirically set the regularization parameter λ = 10 6 and the augmented Lagrangian parameters βi s as the same 1, while for SGD methods, we empirically utilize the step exponential decay (or, called geometric decay) learning rate schedule with the decay exponent 0.95 for every 10 epochs. For SGDM and Adam, we use the default settings as presented in Table 2. The number of epochs in all experiments is empirically set as 2000. The specific settings of these experiments are presented in Table 2. |