Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
AdaGrad stepsizes: Sharp convergence over nonconvex landscapes
Authors: Rachel Ward, Xiaoxia Wu, Leon Bottou
JMLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive numerical experiments are provided to corroborate our theoretical findings; moreover, the experiments suggest that the robustness of Ada Grad-Norm extends to the models in deep learning. |
| Researcher Affiliation | Collaboration | Rachel Ward EMAIL Xiaoxia Wu EMAIL Department of Mathematics The University of Texas at Austin 2515 Speedway, Austin, TX, 78712, USA Léon Bottou EMAIL Facebook AI Research 770 Broadway, New York, NY, 10019, USA |
| Pseudocode | Yes | Algorithm 1 Ada Grad-Norm... Algorithm 2 Gradient Descent with Line Search Method... Algorithm 3 Ada Grad-Norm with momentum in Py Torch |
| Open Source Code | Yes | Details in implementing Ada Grad-Norm in a neural network are explained in the appendix and the code is also provided.4 4. https://github.com/xwuShirley/pytorch/blob/master/torch/optim/adagradnorm.py |
| Open Datasets | Yes | Datasets and Models We test on three data sets: MNIST (Le Cun et al., 1998), CIFAR-10 (Krizhevsky, 2009) and Image Net (Deng et al., 2009), see Table 1 in the appendix for detailed descriptions. |
| Dataset Splits | Yes | Table 1: Statistics of data sets. DIM is the dimension of a sample Dataset Train Test Classes Dim MNIST 60,000 10,000 10 28 28 CIFAR-10 50,000 10,000 10 32 32 Image Net 1,281,167 50,000 1000 Various |
| Hardware Specification | No | For both data sets, we use 256 images per iteration (2 GPUs with 128 images/GPU, 234 iterations per epoch for MNIST and 196 iterations per epoch for CIFAR10). For Imaget Net, we use Res Net-50 and 256 images for one iteration (8 GPUs with 32 images/GPU, 5004 iterations per epoch). The paper mentions using GPUs but does not specify the model or type of GPUs, which is required for specific hardware details. |
| Software Dependencies | No | The experiments are done in Py Torch (Paszke et al., 2017). This citation refers to PyTorch but does not provide a specific version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We set η = 1 in all Ada Grad implementations, noting that in all these problems we know that F = 0 and we measure that F(x0) is between 1 and 10. Indeed, we approximate the loss using a sample of 256 images to be 1 256 P256 i=1 fi(x0): 2.4129 for logistic regression, 2.305 for two-layer fully connected model, 2.301 for convolution neural network, 2.3848 for Res Net-18 with disable learnable parameter in Batch-Norm, 2.3459 for Res Net-18 with default Batch-Norm, and 7.704 for Res Net-50. We vary the initialization b0 while fixing all other parameters and plot the training accuracy and testing accuracy after different numbers of epochs. For the experiment in MNIST, we do not use bias, regularization (zero weight decay), dropout, momentum, batch normalization (Ioffe and Szegedy, 2015), or any other added features that help achieving SOTA performance (see Figure 3 and Figure 4). |