Power-Law Escape Rate of SGD
Authors: Takashi Mori, Liu Ziyin, Kangqiao Liu, Masahito Ueda
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 5.1, we experimentally verify the decoupling approximation for the entire training dynamics. In Section 5.2, we measure the SGD noise strength and confirm that it is indeed proportional to the loss function near a minimum. In Section 5.3, we experimentally test the validity of Eq. (15) for the escape rate. |
| Researcher Affiliation | Academia | 1Center for Emergent Matter Science, Riken, Saitama, Japan 2Department of Physics, The University of Tokyo, Tokyo, Japan 3Institute for Physics of Intelligence, The University of Tokyo, Tokyo, Japan. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. Procedures are described through mathematical derivations and textual explanations. |
| Open Source Code | No | The paper does not provide any statement or link regarding the availability of open-source code for the described methodology. |
| Open Datasets | Yes | We consider a binary classification problem using the first 104 samples of the MNIST dataset... First, we consider training of the Fashion-MNIST dataset... Second, we consider training of the CIFAR-10 dataset... |
| Dataset Splits | No | The paper mentions training data, but does not explicitly provide training/validation/test dataset splits needed to reproduce the experiment. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, lacking details such as specific GPU models, CPU types, or cloud resources with specs. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software components, such as programming languages, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We fix η = 0.01 and B = 100. Starting from the Glorot initialization, the network is trained by SGD of the mini-batch size B = 100 and η = 0.1 for the mean-square loss. We fix B = 100 in both cases, and η = 0.1 for the fully connected network and η = 0.05 for the convolutional network. |