AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes

Authors: Rachel Ward, Xiaoxia Wu, Leon Bottou

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments in Section 4 shows that the robustness of Ada Grad-Norm extends from simple linear regression to state-of-the-art models in deep learning, without sacrificing generalization.
Researcher Affiliation Collaboration Rachel Ward * 1 2 Xiaoxia Wu * 1 2 Léon Bottou 2 *Equal contribution 1Department of Mathematics, The University of Texas at Austin, USA 2Facebook AI Research, New York, USA.
Pseudocode Yes Algorithm 1 Ada Grad-Norm
Open Source Code Yes Details in implementing Ada Grad-Norm in a neural network are explained in the appendix and the code is also provided. 3Ada Grad-Norm https://github.com/xwuShirley/pytorch/ blob/master/torch/optim/adagradnorm.py
Open Datasets Yes Datasets and Models We test on three data sets: MNIST (Le Cun et al., 1998), CIFAR-10 (Krizhevsky, 2009) and Image Net (Deng et al., 2009)
Dataset Splits No The paper mentions using "mini-batch" sizes for training and reports training/testing accuracy, but does not explicitly state the dataset split percentages or sample counts for train/validation/test sets. It also refers to "standard setup" for ResNet without elaboration on splits.
Hardware Specification No The paper mentions "2 GPUs" and "8 GPUs" used for experiments but does not specify the exact GPU models, CPU models, or other hardware specifications.
Software Dependencies No The paper states experiments are "done in Py Torch (Paszke et al., 2017)" but does not specify the version number of PyTorch or any other software dependencies.
Experiment Setup Yes We set η = 1 in Ada Grad-Norm implementations, noting that in all these problems we know that F = 0 and measure that F(x0) is between 1 and 10. [...] For both data sets, we use simple SGD without momentum and set mini-batch of 128 images per iteration [...] For Imaget Net, we use Res Net-50 with no momentum and 256 images for one iteration. [...] In addition, we set the initialization of weights in the last fully connected layer to be i.i.d. Gaussian with zero mean and variance 1/2048.