reproducibilityindex.ai

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes

Authors: Rachel Ward, Xiaoxia Wu, Leon Bottou

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments in Section 4 shows that the robustness of Ada Grad-Norm extends from simple linear regression to state-of-the-art models in deep learning, without sacriﬁcing generalization.
Researcher Affiliation	Collaboration	Rachel Ward * 1 2 Xiaoxia Wu * 1 2 Léon Bottou 2 *Equal contribution 1Department of Mathematics, The University of Texas at Austin, USA 2Facebook AI Research, New York, USA.
Pseudocode	Yes	Algorithm 1 Ada Grad-Norm
Open Source Code	Yes	Details in implementing Ada Grad-Norm in a neural network are explained in the appendix and the code is also provided. 3Ada Grad-Norm https://github.com/xwuShirley/pytorch/ blob/master/torch/optim/adagradnorm.py
Open Datasets	Yes	Datasets and Models We test on three data sets: MNIST (Le Cun et al., 1998), CIFAR-10 (Krizhevsky, 2009) and Image Net (Deng et al., 2009)
Dataset Splits	No	The paper mentions using "mini-batch" sizes for training and reports training/testing accuracy, but does not explicitly state the dataset split percentages or sample counts for train/validation/test sets. It also refers to "standard setup" for ResNet without elaboration on splits.
Hardware Specification	No	The paper mentions "2 GPUs" and "8 GPUs" used for experiments but does not specify the exact GPU models, CPU models, or other hardware specifications.
Software Dependencies	No	The paper states experiments are "done in Py Torch (Paszke et al., 2017)" but does not specify the version number of PyTorch or any other software dependencies.
Experiment Setup	Yes	We set η = 1 in Ada Grad-Norm implementations, noting that in all these problems we know that F = 0 and measure that F(x0) is between 1 and 10. [...] For both data sets, we use simple SGD without momentum and set mini-batch of 128 images per iteration [...] For Imaget Net, we use Res Net-50 with no momentum and 256 images for one iteration. [...] In addition, we set the initialization of weights in the last fully connected layer to be i.i.d. Gaussian with zero mean and variance 1/2048.