Memory Efficient Adaptive Optimization

Authors: Rohan Anil, Vineet Gupta, Tomer Koren, Yoram Singer

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the practical efficacy of SM3 on several machine learning tasks using published state-of-the-art architectures. We focus on three domains: machine translation, language modeling, and image classification. Large scale experiments show that our algorithm achieves comparable, and at times superior, rates of convergence compared to standard linear-space adaptive methods.
Researcher Affiliation Collaboration Rohan Anil Vineet Gupta Google Brain {rohananil,vineet}@google.com, Tomer Koren Google Brain and Tel Aviv Univ., Yoram Singer Princeton Univ. y.s@cs.princeton.edu
Pseudocode Yes See Algorithm SM3-I for its pseudocode. We now discuss a slightly more efficient variant of SM3, which we describe in SM3-II.
Open Source Code Yes We implemented SM3 as an optimizer in Tensor Flow [1]; source code is publicly available at [4]. [4] R. Anil, V. Gupta, T. Koren, and Y. Singer. SM3 tensorflow optimizer. https://github.com/google-research/google-research/tree/master/sm3, 2019.
Open Datasets Yes We experimented with machine translation tasks on two standard datasets from WMT 14: English to French (en!fr) with 36.3M sentence pairs, and English to German (en!de) with 4.5M sentence pairs. Next, we considered a language modeling task on the concatenation of Wikipedia and Books Corpus [29]. image classification on Image Net [20].
Dataset Splits No The paper mentions a 'holdout set' and 'test accuracy' but does not provide specific details on the training, validation, and test splits (e.g., percentages, sample counts) required for reproduction.
Hardware Specification Yes We used the Cloud TPU-v2 device [14] where each core has 8Gi B of memory. Both models were trained on a 4x4 Cloud TPU-v2. The experiments were run using the open sourced code from [10] on a 8x8 Cloud TPU-V2 configuration. We use Cloud TPU-v3 device which has 16Gib per core for this experiment.
Software Dependencies No The paper states 'We implemented SM3 as an optimizer in Tensor Flow [1]' but does not specify the version number for TensorFlow or any other software dependency.
Experiment Setup Yes We trained Transformer-Big on the en!fr dataset with batches of size 384, and compared SM3 with several standard optimizers in each of the tasks. In all cases, we used momentum (including for Adagrad) and extensively tuned all hyperparameters. allowed us to double the number of examples in a batch to a total of 768. The experiments were run using the open sourced code from [10] on a 8x8 Cloud TPU-V2 configuration. Adam and Ada Grad reached at 500k steps.