reproducibilityindex.ai

Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions

Authors: Vladimir Feinberg, Xinyi Chen, Y. Jennifer Sun, Rohan Anil, Elad Hazan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the effectiveness of S-Shampoo as a practical second-order algorithm for training networks, including Res Net-50 [3] for Image Net image classiﬁcation task of Image Net [33] with random crop- ping and ﬂipping augmentations. A 16-layer Conformer model [34] for the audio transcription task, Librispeech [35]. A GNN with 5 message-passing steps [36] on ogbg-molpcba [37], which classiﬁes struc- tural properties of graphically encoded molecule inputs. ... As Fig. 2 demonstrates, the second-order information leveraged by Shampoo results in improvements over Adam, a ﬁrst-order method. Our method performs at least as well as Adam in all cases, despite using asympotically less memory to represent covariance (as Fig. 1 shows, Adam uses O(mn) for a rectangular weight matrix s diagonal accumulators, whereas S-Shampoo uses O(mk + nk)).
Researcher Affiliation	Collaboration	Vladimir Feinberg Google Deep Mind vladf@google.com Xinyi Chen Princeton University Google Deep Mind Y. Jennifer Sun Princeton University Google Deep Mind Rohan Anil Google Deep Mind Elad Hazan Princeton University Google Deep Mind
Pseudocode	Yes	Algorithm 1 Frequent Directions Update (FD-update)... Algorithm 2 Sketchy Ada Grad (S-Ada Grad)... Algorithm 3 Sketchy Shampoo (S-Shampoo)
Open Source Code	No	The paper references several open-source projects like init2winit, JAX, Flax, and MLCommons Algorithmic Efficiency, but it does not provide a specific link or explicit statement for the open-sourcing of the code implementing the methods (Sketchy, S-AdaGrad, S-Shampoo) described in this paper.
Open Datasets	Yes	Res Net-50 [3] for Image Net image classiﬁcation task of Image Net [33] with random crop- ping and ﬂipping augmentations. A 16-layer Conformer model [34] for the audio transcription task, Librispeech [35]. A GNN with 5 message-passing steps [36] on ogbg-molpcba [37], which classiﬁes struc- tural properties of graphically encoded molecule inputs.
Dataset Splits	Yes	We tune only common parameters between the three optimizers with the same budgets, selecting based on validation set accuracy.
Hardware Specification	Yes	from TPUv2 to TPUv3, per-chip bfloat16 operations per second improved 2.67 but memory bandwidth only improved 1.29 . GPUs exhibit a similar pattern for compute and memory increase, at 5 and 2.2 , for V100 to A100 [12].
Software Dependencies	No	The paper mentions software like JAX and Flax, but does not provide specific version numbers for them or any other key software components used in the experiments.
Experiment Setup	Yes	Our FD variant of Shampoo introduces only one new hyperparameter, the rank , which we do not tune, but set to = 256, which translates to 4 memory savings for Shampoo blocks of size 1024 for the accumulators. For all our architectures, we tune Shampoo and extract the intermediate gradient covariances over the course of training. To make our curves comparable across architectures, we ﬁx the parameter for the second moment, β2 = 0.999 for these runs.