reproducibilityindex.ai

Shampoo: Preconditioned Stochastic Tensor Optimization

Authors: Vineet Gupta, Tomer Koren, Yoram Singer

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments with stateof-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. We performed experiments with Shampoo on several datasets, using standard deep neural-network models.
Researcher Affiliation	Collaboration	1Google Brain, Mountain View, CA, USA 2Princeton University, Princeton, NJ, USA.
Pseudocode	Yes	Algorithm 1 Shampoo, matrix case. Initialize W1 = 0m n ; L0 = ϵIm ; R0 = ϵIn for t = 1, . . .,T do: Receive loss function ft : Rm n 7 R Compute gradient Gt = ft(Wt) // Gt Rm n Update preconditioners: Lt = Lt 1 + Gt GT t Rt = Rt 1 + GT t Gt Update parameters: Wt+1 = Wt ηL 1/4 t Gt R 1/4 t. (Algorithm 2 is also present on page 5).
Open Source Code	No	We implemented Shampoo (in its general tensor form) in Python as a new optimizer in the Tensor Flow framework (Abadi et al., 2016). We plan to implement Shampoo in Py Torch in the near future.
Open Datasets	Yes	We performed experiments with Shampoo on several datasets, using standard deep neural-network models. We focused on two domains: image classiﬁcation on CIFAR10/100, and statistical language modeling on LM1B.
Dataset Splits	No	The paper mentions training and testing, but it does not provide specific details about the training/validation/test splits, such as exact percentages, sample counts, or citations to predefined splits.
Hardware Specification	Yes	Table 1 shows the average number of steps (i.e., batches of size 128) per second on a Tesla K40 GPU.
Software Dependencies	No	We implemented Shampoo in its general tensor form in Python as a new Tensor Flow (Abadi et al., 2016) optimizer. The paper mentions Python and TensorFlow but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	In all of our experiments, we worked with a mini-batch of size 128. First, we employed a delayed update for the preconditioners, and recomputed the roots of the matrices Hi t once in every 20 100 steps. Second, we incorporated momentum into the gradient step, essentially computing the running average of the gradients Gt = αGt 1 + (1 α)Gt with a ﬁxed setting of α = 0.9. For each optimization algorithm, we explored 10 diﬀerent learning rates between 0.01 and 10.0 (scaling the entire range for Adam by a factor of 10 4), and chose the one with the best loss. For Shampoo, we simply used the default learning rate of η = 1.0.