Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay

Authors: Zhiyuan Li, Tianhao Wang, Dingli Yu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Though our convergence result is asymptotic, we verify in simplified settings that the phenomena predicted by our theory happens with LR and WD factor λ of practical scale (see Section 6 for details of experiments). We also show empirically that the mixing process exists in practical settings, and is beneficial for generalization.
Researcher Affiliation Academia Zhiyuan Li Princeton University zhiyuanli@cs.princeton.edu Tianhao Wang Yale University tianhao.wang@yale.edu Dingli Yu Princeton University dingliy@cs.princeton.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We include the code and datasets along with instructions needed to reproduce the main experimental results in the supplemental material.
Open Datasets Yes Beyond the toy example, we further study the limiting diffusion of Pre Res Net on CIFAR-10 [26]. (Reference [26]: Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.)
Dataset Splits No Figure 1a shows the train and test accuracy of scale invariant Pre Res Net trained by SGD+WD on CIFAR-10 with standard data augmentation. The paper mentions training and testing, but no explicit validation split details are provided.
Hardware Specification No No specific hardware details (like GPU/CPU models or types of resources) are mentioned in the paper's text. The checklist for experimental details explicitly states 'No' for including compute resources.
Software Dependencies No The paper does not list specific software dependencies with version numbers.
Experiment Setup Yes In our experiments, we choose D = 10, σ = 0.3, the WD factor λ = 0.05, and LR 2 {10 2, 10 3, 10 4}. (Section 6.1). We train a 32-layer Pre Res Net [27] with initial LR = 0.8 and WD factor λ = 5 10 4. (Section 6.2)