Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Authors: Noam Shazeer, Mitchell Stern
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate empirically that this method produces similar results to the baseline. We ran the Transformer model from Vaswani et al. (2017), using Adam with and without our factored second moment estimation for optimization. See Section 9 for more details on the experimental setup. Results were similar in all tested cases. See Table 2 (A) vs. (C) and (H) vs. (J). Results are listed in Table 2. |
| Researcher Affiliation | Collaboration | 1Google Brain, Mountain View, California, USA 2University of California, Berkeley, California, USA. |
| Pseudocode | Yes | Algorithm 1 Adam (Kingma & Ba, 2015), Algorithm 2 Adam for a matrix parameter X with factored second moments and first moment decay parameter β1 = 0., Algorithm 6 Proposed hyperparameters for Adafactor |
| Open Source Code | Yes | Code for running Adafactor is available in the open-source Tensor2Tensor library. |
| Open Datasets | Yes | We evaluated the optimization algorithms described in this paper on the Transformer machine translation model described in Vaswani et al. (2017) on the same WMT 2014 English-to-German translation task described in that paper, using the latest version of the architecture from the Tensor2Tensor open-source repository. |
| Dataset Splits | Yes | Results are listed in Table 2. The listed BLEU scores are on the development set, newstest2013, using beam search with beam size 4 and length penalty α = 0.6. |
| Hardware Specification | Yes | less than two hours each on one Google TPU v2 |
| Software Dependencies | No | using the latest version of the architecture from the Tensor2Tensor open-source repository. |
| Experiment Setup | Yes | Models were trained for 100,000 steps. Each training batch contained sentence pairs containing approximately 4,096 tokens in the input and 4,096 tokens in the target sentences. In one set of experiments, we followed a similar step size schedule as Vaswani et al. (2017) consisting of a linear warmup followed by inverse-square root decay, given by αt = 0.1 min(10 6 t, 1 t). Algorithm 6 Proposed hyperparameters for Adafactor: ϵ1 = 10 30, ϵ2 = 10 3, d = 1, ρt = min 10 2, 1, ˆβ2t = 1 t 0.8 |