The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent

Authors: Lei Wu, Weijie J Su

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Additionally, numerical experiments are provided to support our theoretical findings.
Researcher Affiliation Academia 1School of Mathematical Sciences, Peking University, Beijing, China 2Center for Machine Learning Research, Peking University, Beijing, China 3Wharton Statistics and Data Science Department, University of Pennsylvania, Philadelphia, USA.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks. It focuses on theoretical analysis and numerical results presented in figures.
Open Source Code No The paper does not provide any explicit statements or links indicating that the source code for the methodology described is publicly available.
Open Datasets No The paper uses synthetic data generated according to specified distributions (e.g., “vi iid Unif(Sd 1)”). It does not use or provide access to a publicly available or open dataset.
Dataset Splits No The paper does not explicitly describe train/validation/test dataset splits. While it mentions “training set” in general terms, specific proportions or sample counts for splits are not provided.
Hardware Specification No The paper does not provide specific details about the hardware used to conduct the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks) that would be needed to reproduce the experiments.
Experiment Setup Yes The gradient clipping is automatically switched off since around 4000 iterations. After that, SGD can stably converge to a global minimum without clipping operations. This implies that around the convergent minimum, linear stability should be satisfied and consequently, it is not surprising to observe that Tr(G(θt)) 2/η when θt nearly converge. Another interesting observation is that during the whole training process, Tr(G(θt)) keeps decreasing, which in turn causes the continued decreasing of path norm.