Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent

Authors: Kangqiao Liu, Liu Ziyin, Masahito Ueda

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In section 5, we verify our theoretical results experimentally. In section 6, we apply our solution to some well-known problems that have previously been investigated in the continuous-time limit. A summary of our results is given in Table 1. ... We first consider the case when w R is one-dimensional. The loss function is L(w) = 1/2kw2 with k = 1. In Figure 2(a), we plot the variance of w after 1000 training steps from 104 independent runs. We compare the prediction of Corollary 1 with that of the continuous-time approximation in Mandt et al. (2017). We see that the proposed theory agrees excellently with the experiment, whereas the standard continuous-time approximation fails as λ increases.
Researcher Affiliation Academia 1Department of Physics, the University of Tokyo, Japan 2RIKEN CEMS, Japan 3Institute for Physics of Intelligence, the University of Tokyo, Japan. Correspondence to: Kangqiao Liu <kqliu@cat.phys.s.u-tokyo.ac.jp>, Liu Ziyin <zliu@cat.phys.s.u-tokyo.ac.jp>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes For minibatch noise, we solve a linear regression task with the loss function L(w) = 1/N sum N i (w Txi yi)2, where N = 1000 is the number of data points; for the 1d case, the data points xi are sampled independently from a normal distribution N(0,1); yi = w xi + ϵi with a constant but fixed w , ϵi are noises, also sampled from a normal distribution.
Dataset Splits No The paper does not provide specific details on how the dataset was split into training, validation, and test sets. It mentions using N=1000 data points for linear regression but no split percentages or counts.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup No The paper describes the experimental conditions (e.g., loss function, noise types, N, S, k, λ values tested) but does not provide specific hyperparameter values (like learning rate, batch size, number of epochs for deep learning models) or other training configurations.