Convergence Analysis of Two-layer Neural Networks with ReLU Activation
Authors: Yuanzhi Li, Yang Yuan
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To complement our theory, we are also able to show experimentally that multi-layer networks with this mapping have better performance compared with normal vanilla networks. Our convergence theorem differs from traditional non-convex optimization techniques. We show that SGD converges to optimal in two phases : In phase I, the gradient points to the wrong direction, however, a potential function g gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment veriļ¬es our claims. |
| Researcher Affiliation | Academia | Yuanzhi Li Computer Science Department Princeton University yuanzhil@cs.princeton.edu Yang Yuan Computer Science Department Cornell University yangyuan@cs.cornell.edu |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Our code can be found in the supplementary materials. |
| Open Datasets | Yes | In this experiment, we choose Cifar-10 as the dataset, and all the networks have 56-layers. |
| Dataset Splits | No | The paper mentions 'training set of size 100,000, and test set of size 10,000' in Section 5.2 but does not explicitly detail a validation split or its size. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) were mentioned for running experiments. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9') were found in the paper. |
| Experiment Setup | Yes | We use batch size 200, step size 0.001. We run Res Link for 5 times with random initialization ( W 2 0.6 and W F 5), and plot the curves by taking the average. |