Strength of Minibatch Noise in SGD

Authors: Liu Ziyin, Kangqiao Liu, Takashi Mori, Masahito Ueda

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work presents the first systematic study of the SGD noise and fluctuations close to a local minimum. We first analyze the SGD noise in linear regression in detail and then derive a general formula for approximating SGD noise in different types of minima. For reference, the relationship of this work to the previous works is shown in Table 1. A EXPERIMENTS: We run 1d experiment in Figure 4(a) and high dimensional experiments in Figures 4(b)-(c), where we choose D = 2 for visualization.
Researcher Affiliation Academia Liu Ziyin , Kangqiao Liu , Takashi Mori, & Masahito Ueda The University of Tokyo
Pseudocode No The paper provides mathematical derivations and discusses algorithms (SGD), but it does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statements about releasing code or links to source code repositories.
Open Datasets Yes We train a two-layer tanh neural network on MNIST and plot the variance of its training loss in the first epoch with fixed λ = 0.5. We train a logistic regressor on the MNIST dataset with a large learning rate (of order O(1)).
Dataset Splits No The paper mentions using the MNIST dataset and training models but does not specify the train/validation/test split percentages or sample counts for reproduction. While MNIST has a standard split, it is not explicitly mentioned here.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes We train a two-layer tanh neural network on MNIST and plot the variance of its training loss in the first epoch with fixed λ = 0.5. We train a logistic regressor on the MNIST dataset with a large learning rate (of order O(1)). In Figure 3-Left, we run a 1d experiment with λ = 1, N = 10000 and σ2 = 0.25. In Figure 3-Right, we plot a standard case where the optimal regularization strength γ is vanishing. The parameters are set to be a = 1, λ = 0.5, S = 1.