AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes
Authors: Rachel Ward, Xiaoxia Wu, Leon Bottou
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments in Section 4 shows that the robustness of Ada Grad-Norm extends from simple linear regression to state-of-the-art models in deep learning, without sacrificing generalization. |
| Researcher Affiliation | Collaboration | Rachel Ward * 1 2 Xiaoxia Wu * 1 2 Léon Bottou 2 *Equal contribution 1Department of Mathematics, The University of Texas at Austin, USA 2Facebook AI Research, New York, USA. |
| Pseudocode | Yes | Algorithm 1 Ada Grad-Norm |
| Open Source Code | Yes | Details in implementing Ada Grad-Norm in a neural network are explained in the appendix and the code is also provided. 3Ada Grad-Norm https://github.com/xwuShirley/pytorch/ blob/master/torch/optim/adagradnorm.py |
| Open Datasets | Yes | Datasets and Models We test on three data sets: MNIST (Le Cun et al., 1998), CIFAR-10 (Krizhevsky, 2009) and Image Net (Deng et al., 2009) |
| Dataset Splits | No | The paper mentions using "mini-batch" sizes for training and reports training/testing accuracy, but does not explicitly state the dataset split percentages or sample counts for train/validation/test sets. It also refers to "standard setup" for ResNet without elaboration on splits. |
| Hardware Specification | No | The paper mentions "2 GPUs" and "8 GPUs" used for experiments but does not specify the exact GPU models, CPU models, or other hardware specifications. |
| Software Dependencies | No | The paper states experiments are "done in Py Torch (Paszke et al., 2017)" but does not specify the version number of PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We set η = 1 in Ada Grad-Norm implementations, noting that in all these problems we know that F = 0 and measure that F(x0) is between 1 and 10. [...] For both data sets, we use simple SGD without momentum and set mini-batch of 128 images per iteration [...] For Imaget Net, we use Res Net-50 with no momentum and 256 images for one iteration. [...] In addition, we set the initialization of weights in the last fully connected layer to be i.i.d. Gaussian with zero mean and variance 1/2048. |