Global Convergence of Gradient Descent for Deep Linear Residual Networks
Authors: Lei Wu, Qingcan Wang, Chao Ma
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We numerically compare the gradient descent dynamics between the ZAS and the near-identity initialization for multi-dimensional deep linear networks. The comparison clearly shows that the convergence of gradient descent with the near-identity initialization involves a saddle point escape process, while the ZAS initialization never encounters any saddle point during the whole optimization process. We provide an extension of the ZAS initialization to the nonlinear case. Moreover, the numerical experiments justify its superiority compared to the standard initializations. |
| Researcher Affiliation | Academia | Lei Wu Qingcan Wang Chao Ma Program in Applied and Computational Mathematics Princeton University Princeton, NJ 08544, USA {leiwu,qingcanw,chaom}@princeton.edu |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing the source code or a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | The experiments are conducted on Fashion-MNIST [20], where we select 1000 training samples forming the new training set to speed up the computation. |
| Dataset Splits | No | The paper mentions using a 'training set' and refers to testing, but does not provide specific details about any validation set splits (e.g., percentages, sample counts, or explicit references to predefined validation splits). |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments were provided in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers were mentioned in the paper. |
| Experiment Setup | Yes | We manually tune the optimal learning rate for each L. ... The learning rate η = 0.01 for both initialization. ... Depth L = 100, 200, 2000, 10000 are tested, and the learning rate for each depth is tuned to the achieve the fastest convergence. ... L=100, lr=1e-1, m ZAS L=200, lr=1e-1, m ZAS L=2000, lr=2e-2, m ZAS L=10000, lr=2e-3, m ZAS L=100, lr=1e-3, Xavier L=200, lr=1e-6, Xavier |