A Convergence Theory for Deep Learning via Over-Parameterization

Authors: Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1: Landscapes of the CIFAR10 image-classification training objective F (W ) near points W = Wt on the SGD training trajectory. The x and y axes represent the gradient direction F (Wt) and the most negatively curved direction of the Hessian after smoothing (approximately found by Oja s method (Allen-Zhu & Li, 2017; 2018)). The z axis represents the objective value. Observation. As far as minimizing objective is concerned, the (negative) gradient direction sufficiently decreases the training objective. This is consistent with our main findings Theorem 3 and 4. Using second-order information gives little help. Remark 2. The task is CIFAR10 (for CIFAR100 or CIFAR10 with noisy label, see Figure 2 through 7 in appendix). Remark 4. The six plots correspond to epoch 5, 40, 90, 120, 130 and 160. We start with learning rate 0.1, and decrease it to 0.01 at epoch 81, and to 0.001 at epoch 122. SGD with momentum 0.9 is used. The training code is unchanged from (Yang, 2018) and we only write new code for plotting such landscapes.
Researcher Affiliation Collaboration 1Microsoft Research AI 2Stanford University 3Princeton University 4UT-Austin 5University Washington 6Harvard University.
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper refers to an external source 'The training code is unchanged from (Yang, 2018)' but does not provide a link to its own open-source code for the methodology described.
Open Datasets Yes Remark 2. The task is CIFAR10 (for CIFAR100 or CIFAR10 with noisy label, see Figure 2 through 7 in appendix).
Dataset Splits No The paper does not provide specific details on training, validation, and test dataset splits such as percentages or sample counts.
Hardware Specification No The paper does not provide specific details about the hardware used for the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions 'Py Torch' in Figure 1's caption, but it does not specify the version number of PyTorch or any other software dependencies.
Experiment Setup Yes We start with learning rate 0.1, and decrease it to 0.01 at epoch 81, and to 0.001 at epoch 122. SGD with momentum 0.9 is used. The training code is unchanged from (Yang, 2018) and we only write new code for plotting such landscapes.