Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent
Authors: Kangqiao Liu, Liu Ziyin, Masahito Ueda
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In section 5, we verify our theoretical results experimentally. In section 6, we apply our solution to some well-known problems that have previously been investigated in the continuous-time limit. A summary of our results is given in Table 1. ... We first consider the case when w R is one-dimensional. The loss function is L(w) = 1/2kw2 with k = 1. In Figure 2(a), we plot the variance of w after 1000 training steps from 104 independent runs. We compare the prediction of Corollary 1 with that of the continuous-time approximation in Mandt et al. (2017). We see that the proposed theory agrees excellently with the experiment, whereas the standard continuous-time approximation fails as λ increases. |
| Researcher Affiliation | Academia | 1Department of Physics, the University of Tokyo, Japan 2RIKEN CEMS, Japan 3Institute for Physics of Intelligence, the University of Tokyo, Japan. Correspondence to: Kangqiao Liu <kqliu@cat.phys.s.u-tokyo.ac.jp>, Liu Ziyin <zliu@cat.phys.s.u-tokyo.ac.jp>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | For minibatch noise, we solve a linear regression task with the loss function L(w) = 1/N sum N i (w Txi yi)2, where N = 1000 is the number of data points; for the 1d case, the data points xi are sampled independently from a normal distribution N(0,1); yi = w xi + ϵi with a constant but fixed w , ϵi are noises, also sampled from a normal distribution. |
| Dataset Splits | No | The paper does not provide specific details on how the dataset was split into training, validation, and test sets. It mentions using N=1000 data points for linear regression but no split percentages or counts. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | No | The paper describes the experimental conditions (e.g., loss function, noise types, N, S, k, λ values tested) but does not provide specific hyperparameter values (like learning rate, batch size, number of epochs for deep learning models) or other training configurations. |