Compressing Gradient Optimizers via Count-Sketches
Authors: Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show a rigorous evaluation on popular architectures such as Res Net-18 and Transformer-XL. On the 1-Billion Word dataset, we save 25% of the memory used during training (7.7 GB instead of 10.8 GB) with minimal accuracy and performance loss. For an Amazon extreme classification task with over 49.5 million classes, we also reduce the training time by 38%, by increasing the mini-batch size 3.5 using our count-sketch optimizer. |
| Researcher Affiliation | Collaboration | Ryan Spring 1 * Anastasios Kyrillidis 1 Vijai Mohan 2 Anshumali Shrivastava 1 2 ... 1Department of Computer Science, Rice University, Houston, TX, USA 2Amazon Search, Palo Alto, CA, USA. Correspondence to: Ryan Spring <rdspring1@rice.edu>. |
| Pseudocode | Yes | Algorithm 1 Count-Sketch Tensor; Algorithm 2 Momentum Count Sketch Optimizer; Algorithm 3 Adagrad Count Sketch Optimizer; Algorithm 4 Adam Count Sketch Optimizer |
| Open Source Code | Yes | The code1 for the Count Sketch Optimizer is available online. 1https://github.com/rdspring1/Count-Sketch-Optimizers |
| Open Datasets | Yes | Wikitext-2 (Merity et al., 2016); Wikitext-103 (Merity et al., 2016); 1-Billion Word (LM1B) (Chelba et al., 2013) An open-sourced Py Torch model is available online2. 2https://github.com/rdspring1/Py Torch GBW LM; Mega Face (Nech & Kemelmacher-Shlizerman, 2017); Image Net (Russakovsky et al., 2015) |
| Dataset Splits | No | The paper mentions 'validation error' for Wikitext-2 and '10K images are randomly sampled to create the test dataset' for Mega Face, but it does not consistently provide specific percentages or absolute counts for training, validation, and test splits across all datasets, nor does it specify a general splitting methodology that would apply. |
| Hardware Specification | Yes | All of the experiments were performed with the Py Torch framework on a single machine 2x Intel Xeon E5-2660 v4 processors (28 cores / 56 threads) with 512 GB of memory using a single Nvidia Tesla V100. |
| Software Dependencies | No | The paper mentions 'Py Torch framework' but does not specify its version number or the version numbers of any other key software components, libraries, or solvers required for replication. |
| Experiment Setup | Yes | For Momentum, the learning rate was 2.5, the decay rate γ was 0.9, and we clipped the gradient norm to 0.25. For Adam, the learning rate was 0.001, the beta values β1, β2 were 0.9 and 0.999, and gradient clipping was 1. (Wikitext-2); We trained a Res Net-18 architecture on the Image Net dataset for 90 epochs with a batch size of 256. The baseline optimizer was RMSprop with a learning rate of 0.01. (ImageNet) |