Shampoo: Preconditioned Stochastic Tensor Optimization
Authors: Vineet Gupta, Tomer Koren, Yoram Singer
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments with stateof-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. We performed experiments with Shampoo on several datasets, using standard deep neural-network models. |
| Researcher Affiliation | Collaboration | 1Google Brain, Mountain View, CA, USA 2Princeton University, Princeton, NJ, USA. |
| Pseudocode | Yes | Algorithm 1 Shampoo, matrix case. Initialize W1 = 0m n ; L0 = ϵIm ; R0 = ϵIn for t = 1, . . .,T do: Receive loss function ft : Rm n 7 R Compute gradient Gt = ft(Wt) // Gt Rm n Update preconditioners: Lt = Lt 1 + Gt GT t Rt = Rt 1 + GT t Gt Update parameters: Wt+1 = Wt ηL 1/4 t Gt R 1/4 t. (Algorithm 2 is also present on page 5). |
| Open Source Code | No | We implemented Shampoo (in its general tensor form) in Python as a new optimizer in the Tensor Flow framework (Abadi et al., 2016). We plan to implement Shampoo in Py Torch in the near future. |
| Open Datasets | Yes | We performed experiments with Shampoo on several datasets, using standard deep neural-network models. We focused on two domains: image classification on CIFAR10/100, and statistical language modeling on LM1B. |
| Dataset Splits | No | The paper mentions training and testing, but it does not provide specific details about the training/validation/test splits, such as exact percentages, sample counts, or citations to predefined splits. |
| Hardware Specification | Yes | Table 1 shows the average number of steps (i.e., batches of size 128) per second on a Tesla K40 GPU. |
| Software Dependencies | No | We implemented Shampoo in its general tensor form in Python as a new Tensor Flow (Abadi et al., 2016) optimizer. The paper mentions Python and TensorFlow but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | In all of our experiments, we worked with a mini-batch of size 128. First, we employed a delayed update for the preconditioners, and recomputed the roots of the matrices Hi t once in every 20 100 steps. Second, we incorporated momentum into the gradient step, essentially computing the running average of the gradients Gt = αGt 1 + (1 α)Gt with a fixed setting of α = 0.9. For each optimization algorithm, we explored 10 different learning rates between 0.01 and 10.0 (scaling the entire range for Adam by a factor of 10 4), and chose the one with the best loss. For Shampoo, we simply used the default learning rate of η = 1.0. |