reproducibilityindex.ai

Efficient Full-Matrix Adaptive Regularization

Authors: Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, Yi Zhang

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our preliminary experiments show improved iteration-wise convergence rates across synthetic tasks and standard deep learning benchmarks, and that the more carefullypreconditioned steps sometimes lead to a better solution. In this section, we present an empirical study of GGT. We begin with some simple experiments, showing that adaptive methods help in the presence of ill-conditioned optimization problems, as well as the value of limited gradient memory. Next, we evaluate the performance of GGT on largerscale deep learning tasks (and provide some additional such experiments in Appendix B).
Researcher Affiliation	Collaboration	1Google AI Princeton 2Department of Computer Science, Princeton University. Correspondence to: Cyril Zhang <cyril.zhang@cs.princeton.edu>.
Pseudocode	Yes	The mathematical speciﬁcation of GGT is given in Algorithm 1, in the usual model of stochastic optimization (see Section 4), with gradients e f(x).
Open Source Code	No	The paper does not provide any explicit statements about the availability of open-source code for the described methodology, nor does it include any links to a code repository.
Open Datasets	Yes	We investigated the training dynamics of GGT on a typical deep architecture for computer vision. For this, we used a 26-layer 3-branch residual network with Shake-Shake regularization (Gastaldi, 2017). Next, we move to recurrent architectures for language modeling. We train a 3-layer LSTM (Hochreiter & Schmidhuber, 1997) with 5M parameters for character-level modeling of the Penn Treebank dataset (Marcus et al., 1994).
Dataset Splits	Yes	Our results are shown in Figure 3 (top). In terms of training loss, GGT consistently dominated existing optimizers. We used a batch size of 128, and the standard data augmentation techniques of 4-pixel padding + random cropping and horizontal ﬂipping. Figure 3 (bottom) shows training and validation perplexities for the ﬁrst 50 epochs; no optimizer makes signiﬁcant progress afterwards.
Hardware Specification	No	The paper mentions 'GPU' and discusses running time overheads, but it does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies	No	The paper mentions various optimizers and frameworks indirectly through citations (e.g., TensorFlow, PyTorch), but it does not specify any software dependencies with version numbers that would be required to reproduce the experiments.
Experiment Setup	Yes	For both Adam and GGT, we chose the commonly used parameters β1 = 0.9, β2 = 0.999, ε = 10 8; for SGD, we used momentum with parameter 0.9. We used a batch size of 128, and the standard data augmentation techniques of 4-pixel padding + random cropping and horizontal ﬂipping. In each experiment, we kept the cosine learning rate annealing schedule used in the paper, originally from (Loshchilov & Hutter, 2016); performance degraded consistently and signiﬁcantly with a ﬁxed learning rate.