Shampoo: Preconditioned Stochastic Tensor Optimization

Authors: Vineet Gupta, Tomer Koren, Yoram Singer

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments with stateof-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. We performed experiments with Shampoo on several datasets, using standard deep neural-network models.
Researcher Affiliation Collaboration 1Google Brain, Mountain View, CA, USA 2Princeton University, Princeton, NJ, USA.
Pseudocode Yes Algorithm 1 Shampoo, matrix case. Initialize W1 = 0m n ; L0 = ϵIm ; R0 = ϵIn for t = 1, . . .,T do: Receive loss function ft : Rm n 7 R Compute gradient Gt = ft(Wt) // Gt Rm n Update preconditioners: Lt = Lt 1 + Gt GT t Rt = Rt 1 + GT t Gt Update parameters: Wt+1 = Wt ηL 1/4 t Gt R 1/4 t. (Algorithm 2 is also present on page 5).
Open Source Code No We implemented Shampoo (in its general tensor form) in Python as a new optimizer in the Tensor Flow framework (Abadi et al., 2016). We plan to implement Shampoo in Py Torch in the near future.
Open Datasets Yes We performed experiments with Shampoo on several datasets, using standard deep neural-network models. We focused on two domains: image classification on CIFAR10/100, and statistical language modeling on LM1B.
Dataset Splits No The paper mentions training and testing, but it does not provide specific details about the training/validation/test splits, such as exact percentages, sample counts, or citations to predefined splits.
Hardware Specification Yes Table 1 shows the average number of steps (i.e., batches of size 128) per second on a Tesla K40 GPU.
Software Dependencies No We implemented Shampoo in its general tensor form in Python as a new Tensor Flow (Abadi et al., 2016) optimizer. The paper mentions Python and TensorFlow but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes In all of our experiments, we worked with a mini-batch of size 128. First, we employed a delayed update for the preconditioners, and recomputed the roots of the matrices Hi t once in every 20 100 steps. Second, we incorporated momentum into the gradient step, essentially computing the running average of the gradients Gt = αGt 1 + (1 α)Gt with a fixed setting of α = 0.9. For each optimization algorithm, we explored 10 different learning rates between 0.01 and 10.0 (scaling the entire range for Adam by a factor of 10 4), and chose the one with the best loss. For Shampoo, we simply used the default learning rate of η = 1.0.