On the Global Convergence of Training Deep Linear ResNets

Authors: Difan Zou, Philip M. Long, Quanquan Gu

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training L-hidden-layer linear residual networks (Res Nets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are fixed throughout training, both GD and SGD with zero initialization on all hidden weights can converge to the global minimum of the training loss. Moreover, when specializing to appropriate Gaussian random linear transformations, GD and SGD provably optimize wide enough deep linear Res Nets. Compared with the global convergence result of GD for training standard deep linear networks (Du & Hu, 2019), our condition on the neural network width is sharper by a factor of OpκLq, where κ denotes the condition number of the covariance matrix of the training data. We further propose a modified identity input and output transformations, and show that a pd kq-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where d, k are the input and output dimensions respectively. ... In this section, we conduct various experiments to verify our theory on synthetic data, including i) comparison between different input and output transformations and ii) comparison between training deep linear Res Nets and standard linear networks.
Researcher Affiliation Collaboration Difan Zou Department of Computer Science University of California, Los Angeles knowzou@cs.ucla.edu Philip M. Long Google plong@google.com Quanquan Gu Department of Computer Science University of California, Los Angeles qgu@cs.ucla.edu
Pseudocode Yes Algorithm 1 (Stochastic) Gradient descent with zero initialization 1: input: Training data txi, yiui Prns, step size η, total number of iterations T, minibatch size B, input and output weight matrices A and B. 2: initialization: For all l P r Ls, each entry of weight matrix Wp0q l is initialized as 0. Gradient Descent 3: for t 0, . . . , T 1 do 4: Wpt 1q l Wptq l η Wl Lp Wptqq for all l P r Ls 5: end for 6: output: Wp T q Stochastic Gradient Descent 7: for t 0, . . . , T 1 do 8: Uniformly sample a subset Bptq of size B from training data without replacement. 9: For all ℓP r Ls, compute the stochastic gradient Gptq l n i PBptq Wlℓp Wptq; xi, yiq 10: For all l P r Ls, Wpt 1q l Wptq l ηGptq l 11: end for 12: output: t Wptqut 0,...,T
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets No In this section, we conduct various experiments to verify our theory on synthetic data, including i) comparison between different input and output transformations and ii) comparison between training deep linear Res Nets and standard linear networks. Specifically, we randomly generate X P R10ˆ1000 from a standard normal distribution and set Y X 0.1 E, where each entry in E is independently generated from standard normal distribution.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes Consider 10-hidden-layer linear Res Nets, we apply three input and output transformations including identity transformations, modified identity transformations and random transformations. We evaluate the convergence performances for these three choices of transformations and report the results in Figures 1(a)-1(b), where we consider two cases m 40 and m 200. ... Specifically, we adopt the same training data generated in Section 5.1 and consider training Lhidden-layer neural network with fixed width m. ... For training linear Res Nets, we found that the convergence performances are quite similar for different L, thus we only plot the convergence result for the largest one (e.g., L 20 for m 40 and L 100 for m 200). ... For all l P r Ls, each entry of weight matrix Wp0q l is initialized as 0.