Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Authors: Noam Shazeer, Mitchell Stern

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically that this method produces similar results to the baseline. We ran the Transformer model from Vaswani et al. (2017), using Adam with and without our factored second moment estimation for optimization. See Section 9 for more details on the experimental setup. Results were similar in all tested cases. See Table 2 (A) vs. (C) and (H) vs. (J). Results are listed in Table 2.
Researcher Affiliation Collaboration 1Google Brain, Mountain View, California, USA 2University of California, Berkeley, California, USA.
Pseudocode Yes Algorithm 1 Adam (Kingma & Ba, 2015), Algorithm 2 Adam for a matrix parameter X with factored second moments and first moment decay parameter β1 = 0., Algorithm 6 Proposed hyperparameters for Adafactor
Open Source Code Yes Code for running Adafactor is available in the open-source Tensor2Tensor library.
Open Datasets Yes We evaluated the optimization algorithms described in this paper on the Transformer machine translation model described in Vaswani et al. (2017) on the same WMT 2014 English-to-German translation task described in that paper, using the latest version of the architecture from the Tensor2Tensor open-source repository.
Dataset Splits Yes Results are listed in Table 2. The listed BLEU scores are on the development set, newstest2013, using beam search with beam size 4 and length penalty α = 0.6.
Hardware Specification Yes less than two hours each on one Google TPU v2
Software Dependencies No using the latest version of the architecture from the Tensor2Tensor open-source repository.
Experiment Setup Yes Models were trained for 100,000 steps. Each training batch contained sentence pairs containing approximately 4,096 tokens in the input and 4,096 tokens in the target sentences. In one set of experiments, we followed a similar step size schedule as Vaswani et al. (2017) consisting of a linear warmup followed by inverse-square root decay, given by αt = 0.1 min(10 6 t, 1 t). Algorithm 6 Proposed hyperparameters for Adafactor: ϵ1 = 10 30, ϵ2 = 10 3, d = 1, ρt = min 10 2, 1, ˆβ2t = 1 t 0.8