Power-Law Escape Rate of SGD

Authors: Takashi Mori, Liu Ziyin, Kangqiao Liu, Masahito Ueda

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 5.1, we experimentally verify the decoupling approximation for the entire training dynamics. In Section 5.2, we measure the SGD noise strength and confirm that it is indeed proportional to the loss function near a minimum. In Section 5.3, we experimentally test the validity of Eq. (15) for the escape rate.
Researcher Affiliation Academia 1Center for Emergent Matter Science, Riken, Saitama, Japan 2Department of Physics, The University of Tokyo, Tokyo, Japan 3Institute for Physics of Intelligence, The University of Tokyo, Tokyo, Japan.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks. Procedures are described through mathematical derivations and textual explanations.
Open Source Code No The paper does not provide any statement or link regarding the availability of open-source code for the described methodology.
Open Datasets Yes We consider a binary classification problem using the first 104 samples of the MNIST dataset... First, we consider training of the Fashion-MNIST dataset... Second, we consider training of the CIFAR-10 dataset...
Dataset Splits No The paper mentions training data, but does not explicitly provide training/validation/test dataset splits needed to reproduce the experiment.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, lacking details such as specific GPU models, CPU types, or cloud resources with specs.
Software Dependencies No The paper does not provide specific version numbers for any software components, such as programming languages, libraries, or frameworks used in the experiments.
Experiment Setup Yes We fix η = 0.01 and B = 100. Starting from the Glorot initialization, the network is trained by SGD of the mini-batch size B = 100 and η = 0.1 for the mean-square loss. We fix B = 100 in both cases, and η = 0.1 for the fully connected network and η = 0.05 for the convolutional network.