SGD Can Converge to Local Maxima

Authors: Liu Ziyin, Botao Li, James B Simon, Masahito Ueda

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also realize results in a minimal neural network-like example. In Sec. 6, we present the numerical simulations, including a minimal example involving a neural network.
Researcher Affiliation Academia 1The University of Tokyo 2ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris 3University of California, Berkeley
Pseudocode No The paper defines algorithms (e.g., SGD and AMSGrad) using mathematical equations (e.g.,
Open Source Code No The paper does not provide any statement about releasing source code or links to a code repository.
Open Datasets No The paper uses
Dataset Splits No The paper does not specify explicit training/validation/test dataset splits. For the toy neural network example, it mentions
Hardware Specification No The paper does not provide any specific details about the hardware used for running experiments (e.g., specific GPU/CPU models, memory, or cloud computing instances).
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes In this numerical example, we set λ = 0.8 and a = −1... we set λ = 0.2 and β2 = 0.999 for both Adam and AMSGrad. When momentum is used, we set β1 = 0.9. GD is run with a learning rate of 0.01. ...w1 is initialized uniformly in [−1,1]; w2 is initialized uniformly in [0,1]... at a small learning rate (λ = 0.001)... when the learning rate is large (λ = 0.1).