reproducibilityindex.ai

On the Global Convergence of Training Deep Linear ResNets

Authors: Difan Zou, Philip M. Long, Quanquan Gu

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training L-hidden-layer linear residual networks (Res Nets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are ﬁxed throughout training, both GD and SGD with zero initialization on all hidden weights can converge to the global minimum of the training loss. Moreover, when specializing to appropriate Gaussian random linear transformations, GD and SGD provably optimize wide enough deep linear Res Nets. Compared with the global convergence result of GD for training standard deep linear networks (Du & Hu, 2019), our condition on the neural network width is sharper by a factor of OpκLq, where κ denotes the condition number of the covariance matrix of the training data. We further propose a modiﬁed identity input and output transformations, and show that a pd kq-wide neural network is sufﬁcient to guarantee the global convergence of GD/SGD, where d, k are the input and output dimensions respectively. ... In this section, we conduct various experiments to verify our theory on synthetic data, including i) comparison between different input and output transformations and ii) comparison between training deep linear Res Nets and standard linear networks.
Researcher Affiliation	Collaboration	Difan Zou Department of Computer Science University of California, Los Angeles knowzou@cs.ucla.edu Philip M. Long Google plong@google.com Quanquan Gu Department of Computer Science University of California, Los Angeles qgu@cs.ucla.edu
Pseudocode	Yes	Algorithm 1 (Stochastic) Gradient descent with zero initialization 1: input: Training data txi, yiui Prns, step size η, total number of iterations T, minibatch size B, input and output weight matrices A and B. 2: initialization: For all l P r Ls, each entry of weight matrix Wp0q l is initialized as 0. Gradient Descent 3: for t 0, . . . , T 1 do 4: Wpt 1q l Wptq l η Wl Lp Wptqq for all l P r Ls 5: end for 6: output: Wp T q Stochastic Gradient Descent 7: for t 0, . . . , T 1 do 8: Uniformly sample a subset Bptq of size B from training data without replacement. 9: For all ℓP r Ls, compute the stochastic gradient Gptq l n i PBptq Wlℓp Wptq; xi, yiq 10: For all l P r Ls, Wpt 1q l Wptq l ηGptq l 11: end for 12: output: t Wptqut 0,...,T
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	No	In this section, we conduct various experiments to verify our theory on synthetic data, including i) comparison between different input and output transformations and ii) comparison between training deep linear Res Nets and standard linear networks. Specifically, we randomly generate X P R10ˆ1000 from a standard normal distribution and set Y X 0.1 E, where each entry in E is independently generated from standard normal distribution.
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	Consider 10-hidden-layer linear Res Nets, we apply three input and output transformations including identity transformations, modiﬁed identity transformations and random transformations. We evaluate the convergence performances for these three choices of transformations and report the results in Figures 1(a)-1(b), where we consider two cases m 40 and m 200. ... Speciﬁcally, we adopt the same training data generated in Section 5.1 and consider training Lhidden-layer neural network with ﬁxed width m. ... For training linear Res Nets, we found that the convergence performances are quite similar for different L, thus we only plot the convergence result for the largest one (e.g., L 20 for m 40 and L 100 for m 200). ... For all l P r Ls, each entry of weight matrix Wp0q l is initialized as 0.