reproducibilityindex.ai

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Authors: Noam Shazeer, Mitchell Stern

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate empirically that this method produces similar results to the baseline. We ran the Transformer model from Vaswani et al. (2017), using Adam with and without our factored second moment estimation for optimization. See Section 9 for more details on the experimental setup. Results were similar in all tested cases. See Table 2 (A) vs. (C) and (H) vs. (J). Results are listed in Table 2.
Researcher Affiliation	Collaboration	1Google Brain, Mountain View, California, USA 2University of California, Berkeley, California, USA.
Pseudocode	Yes	Algorithm 1 Adam (Kingma & Ba, 2015), Algorithm 2 Adam for a matrix parameter X with factored second moments and ﬁrst moment decay parameter β1 = 0., Algorithm 6 Proposed hyperparameters for Adafactor
Open Source Code	Yes	Code for running Adafactor is available in the open-source Tensor2Tensor library.
Open Datasets	Yes	We evaluated the optimization algorithms described in this paper on the Transformer machine translation model described in Vaswani et al. (2017) on the same WMT 2014 English-to-German translation task described in that paper, using the latest version of the architecture from the Tensor2Tensor open-source repository.
Dataset Splits	Yes	Results are listed in Table 2. The listed BLEU scores are on the development set, newstest2013, using beam search with beam size 4 and length penalty α = 0.6.
Hardware Specification	Yes	less than two hours each on one Google TPU v2
Software Dependencies	No	using the latest version of the architecture from the Tensor2Tensor open-source repository.
Experiment Setup	Yes	Models were trained for 100,000 steps. Each training batch contained sentence pairs containing approximately 4,096 tokens in the input and 4,096 tokens in the target sentences. In one set of experiments, we followed a similar step size schedule as Vaswani et al. (2017) consisting of a linear warmup followed by inverse-square root decay, given by αt = 0.1 min(10 6 t, 1 t). Algorithm 6 Proposed hyperparameters for Adafactor: ϵ1 = 10 30, ϵ2 = 10 3, d = 1, ρt = min 10 2, 1, ˆβ2t = 1 t 0.8