reproducibilityindex.ai

Compressing Gradient Optimizers via Count-Sketches

Authors: Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show a rigorous evaluation on popular architectures such as Res Net-18 and Transformer-XL. On the 1-Billion Word dataset, we save 25% of the memory used during training (7.7 GB instead of 10.8 GB) with minimal accuracy and performance loss. For an Amazon extreme classiﬁcation task with over 49.5 million classes, we also reduce the training time by 38%, by increasing the mini-batch size 3.5 using our count-sketch optimizer.
Researcher Affiliation	Collaboration	Ryan Spring 1 * Anastasios Kyrillidis 1 Vijai Mohan 2 Anshumali Shrivastava 1 2 ... 1Department of Computer Science, Rice University, Houston, TX, USA 2Amazon Search, Palo Alto, CA, USA. Correspondence to: Ryan Spring <rdspring1@rice.edu>.
Pseudocode	Yes	Algorithm 1 Count-Sketch Tensor; Algorithm 2 Momentum Count Sketch Optimizer; Algorithm 3 Adagrad Count Sketch Optimizer; Algorithm 4 Adam Count Sketch Optimizer
Open Source Code	Yes	The code1 for the Count Sketch Optimizer is available online. 1https://github.com/rdspring1/Count-Sketch-Optimizers
Open Datasets	Yes	Wikitext-2 (Merity et al., 2016); Wikitext-103 (Merity et al., 2016); 1-Billion Word (LM1B) (Chelba et al., 2013) An open-sourced Py Torch model is available online2. 2https://github.com/rdspring1/Py Torch GBW LM; Mega Face (Nech & Kemelmacher-Shlizerman, 2017); Image Net (Russakovsky et al., 2015)
Dataset Splits	No	The paper mentions 'validation error' for Wikitext-2 and '10K images are randomly sampled to create the test dataset' for Mega Face, but it does not consistently provide specific percentages or absolute counts for training, validation, and test splits across all datasets, nor does it specify a general splitting methodology that would apply.
Hardware Specification	Yes	All of the experiments were performed with the Py Torch framework on a single machine 2x Intel Xeon E5-2660 v4 processors (28 cores / 56 threads) with 512 GB of memory using a single Nvidia Tesla V100.
Software Dependencies	No	The paper mentions 'Py Torch framework' but does not specify its version number or the version numbers of any other key software components, libraries, or solvers required for replication.
Experiment Setup	Yes	For Momentum, the learning rate was 2.5, the decay rate γ was 0.9, and we clipped the gradient norm to 0.25. For Adam, the learning rate was 0.001, the beta values β1, β2 were 0.9 and 0.999, and gradient clipping was 1. (Wikitext-2); We trained a Res Net-18 architecture on the Image Net dataset for 90 epochs with a batch size of 256. The baseline optimizer was RMSprop with a learning rate of 0.01. (ImageNet)