A Convergence Theory for Deep Learning via Over-Parameterization
Authors: Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 1: Landscapes of the CIFAR10 image-classification training objective F (W ) near points W = Wt on the SGD training trajectory. The x and y axes represent the gradient direction F (Wt) and the most negatively curved direction of the Hessian after smoothing (approximately found by Oja s method (Allen-Zhu & Li, 2017; 2018)). The z axis represents the objective value. Observation. As far as minimizing objective is concerned, the (negative) gradient direction sufficiently decreases the training objective. This is consistent with our main findings Theorem 3 and 4. Using second-order information gives little help. Remark 2. The task is CIFAR10 (for CIFAR100 or CIFAR10 with noisy label, see Figure 2 through 7 in appendix). Remark 4. The six plots correspond to epoch 5, 40, 90, 120, 130 and 160. We start with learning rate 0.1, and decrease it to 0.01 at epoch 81, and to 0.001 at epoch 122. SGD with momentum 0.9 is used. The training code is unchanged from (Yang, 2018) and we only write new code for plotting such landscapes. |
| Researcher Affiliation | Collaboration | 1Microsoft Research AI 2Stanford University 3Princeton University 4UT-Austin 5University Washington 6Harvard University. |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper refers to an external source 'The training code is unchanged from (Yang, 2018)' but does not provide a link to its own open-source code for the methodology described. |
| Open Datasets | Yes | Remark 2. The task is CIFAR10 (for CIFAR100 or CIFAR10 with noisy label, see Figure 2 through 7 in appendix). |
| Dataset Splits | No | The paper does not provide specific details on training, validation, and test dataset splits such as percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions 'Py Torch' in Figure 1's caption, but it does not specify the version number of PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We start with learning rate 0.1, and decrease it to 0.01 at epoch 81, and to 0.001 at epoch 122. SGD with momentum 0.9 is used. The training code is unchanged from (Yang, 2018) and we only write new code for plotting such landscapes. |