reproducibilityindex.ai

Memory Efficient Adaptive Optimization

Authors: Rohan Anil, Vineet Gupta, Tomer Koren, Yoram Singer

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the practical efﬁcacy of SM3 on several machine learning tasks using published state-of-the-art architectures. We focus on three domains: machine translation, language modeling, and image classiﬁcation. Large scale experiments show that our algorithm achieves comparable, and at times superior, rates of convergence compared to standard linear-space adaptive methods.
Researcher Affiliation	Collaboration	Rohan Anil Vineet Gupta Google Brain {rohananil,vineet}@google.com, Tomer Koren Google Brain and Tel Aviv Univ., Yoram Singer Princeton Univ. y.s@cs.princeton.edu
Pseudocode	Yes	See Algorithm SM3-I for its pseudocode. We now discuss a slightly more efﬁcient variant of SM3, which we describe in SM3-II.
Open Source Code	Yes	We implemented SM3 as an optimizer in Tensor Flow [1]; source code is publicly available at [4]. [4] R. Anil, V. Gupta, T. Koren, and Y. Singer. SM3 tensorﬂow optimizer. https://github.com/google-research/google-research/tree/master/sm3, 2019.
Open Datasets	Yes	We experimented with machine translation tasks on two standard datasets from WMT 14: English to French (en!fr) with 36.3M sentence pairs, and English to German (en!de) with 4.5M sentence pairs. Next, we considered a language modeling task on the concatenation of Wikipedia and Books Corpus [29]. image classiﬁcation on Image Net [20].
Dataset Splits	No	The paper mentions a 'holdout set' and 'test accuracy' but does not provide specific details on the training, validation, and test splits (e.g., percentages, sample counts) required for reproduction.
Hardware Specification	Yes	We used the Cloud TPU-v2 device [14] where each core has 8Gi B of memory. Both models were trained on a 4x4 Cloud TPU-v2. The experiments were run using the open sourced code from [10] on a 8x8 Cloud TPU-V2 conﬁguration. We use Cloud TPU-v3 device which has 16Gib per core for this experiment.
Software Dependencies	No	The paper states 'We implemented SM3 as an optimizer in Tensor Flow [1]' but does not specify the version number for TensorFlow or any other software dependency.
Experiment Setup	Yes	We trained Transformer-Big on the en!fr dataset with batches of size 384, and compared SM3 with several standard optimizers in each of the tasks. In all cases, we used momentum (including for Adagrad) and extensively tuned all hyperparameters. allowed us to double the number of examples in a batch to a total of 768. The experiments were run using the open sourced code from [10] on a 8x8 Cloud TPU-V2 conﬁguration. Adam and Ada Grad reached at 500k steps.