Efficient Full-Matrix Adaptive Regularization
Authors: Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, Yi Zhang
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our preliminary experiments show improved iteration-wise convergence rates across synthetic tasks and standard deep learning benchmarks, and that the more carefullypreconditioned steps sometimes lead to a better solution. In this section, we present an empirical study of GGT. We begin with some simple experiments, showing that adaptive methods help in the presence of ill-conditioned optimization problems, as well as the value of limited gradient memory. Next, we evaluate the performance of GGT on largerscale deep learning tasks (and provide some additional such experiments in Appendix B). |
| Researcher Affiliation | Collaboration | 1Google AI Princeton 2Department of Computer Science, Princeton University. Correspondence to: Cyril Zhang <cyril.zhang@cs.princeton.edu>. |
| Pseudocode | Yes | The mathematical specification of GGT is given in Algorithm 1, in the usual model of stochastic optimization (see Section 4), with gradients e f(x). |
| Open Source Code | No | The paper does not provide any explicit statements about the availability of open-source code for the described methodology, nor does it include any links to a code repository. |
| Open Datasets | Yes | We investigated the training dynamics of GGT on a typical deep architecture for computer vision. For this, we used a 26-layer 3-branch residual network with Shake-Shake regularization (Gastaldi, 2017). Next, we move to recurrent architectures for language modeling. We train a 3-layer LSTM (Hochreiter & Schmidhuber, 1997) with 5M parameters for character-level modeling of the Penn Treebank dataset (Marcus et al., 1994). |
| Dataset Splits | Yes | Our results are shown in Figure 3 (top). In terms of training loss, GGT consistently dominated existing optimizers. We used a batch size of 128, and the standard data augmentation techniques of 4-pixel padding + random cropping and horizontal flipping. Figure 3 (bottom) shows training and validation perplexities for the first 50 epochs; no optimizer makes significant progress afterwards. |
| Hardware Specification | No | The paper mentions 'GPU' and discusses running time overheads, but it does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions various optimizers and frameworks indirectly through citations (e.g., TensorFlow, PyTorch), but it does not specify any software dependencies with version numbers that would be required to reproduce the experiments. |
| Experiment Setup | Yes | For both Adam and GGT, we chose the commonly used parameters β1 = 0.9, β2 = 0.999, ε = 10 8; for SGD, we used momentum with parameter 0.9. We used a batch size of 128, and the standard data augmentation techniques of 4-pixel padding + random cropping and horizontal flipping. In each experiment, we kept the cosine learning rate annealing schedule used in the paper, originally from (Loshchilov & Hutter, 2016); performance degraded consistently and significantly with a fixed learning rate. |