Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay
Authors: Zhiyuan Li, Tianhao Wang, Dingli Yu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Though our convergence result is asymptotic, we verify in simplified settings that the phenomena predicted by our theory happens with LR and WD factor λ of practical scale (see Section 6 for details of experiments). We also show empirically that the mixing process exists in practical settings, and is beneficial for generalization. |
| Researcher Affiliation | Academia | Zhiyuan Li Princeton University zhiyuanli@cs.princeton.edu Tianhao Wang Yale University tianhao.wang@yale.edu Dingli Yu Princeton University dingliy@cs.princeton.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We include the code and datasets along with instructions needed to reproduce the main experimental results in the supplemental material. |
| Open Datasets | Yes | Beyond the toy example, we further study the limiting diffusion of Pre Res Net on CIFAR-10 [26]. (Reference [26]: Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.) |
| Dataset Splits | No | Figure 1a shows the train and test accuracy of scale invariant Pre Res Net trained by SGD+WD on CIFAR-10 with standard data augmentation. The paper mentions training and testing, but no explicit validation split details are provided. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models or types of resources) are mentioned in the paper's text. The checklist for experimental details explicitly states 'No' for including compute resources. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. |
| Experiment Setup | Yes | In our experiments, we choose D = 10, σ = 0.3, the WD factor λ = 0.05, and LR 2 {10 2, 10 3, 10 4}. (Section 6.1). We train a 32-layer Pre Res Net [27] with initial LR = 0.8 and WD factor λ = 5 10 4. (Section 6.2) |