The Evolved Transformer

Authors: David So, Quoc Le, Chen Liang

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The architecture found in our experiments the Evolved Transformer demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT 14 English German; at smaller sizes, it achieves the same quality as the original big Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters.
Researcher Affiliation Industry David R. So 1 Chen Liang 1 Quoc V. Le 1 1Google Research, Brain Team, Mountain View, California, USA. Correspondence to: David R. So <davidso@google.com>.
Pseudocode Yes Algorithm 1 (Supplementary Materials) formalizes how the fitness of an individual model is calculated with hurdles and Algorithm 2 (Supplementary Materials) describes tournament selection augmented with Progressive Dynamic Hurdles.
Open Source Code No The paper references `https://github.com/tensorflow/tensor2tensor` for Tensor2Tensor implementations and hyperparameter settings, which is a third-party tool used, not the source code for the Evolved Transformer developed in this paper. No explicit statement or link for the paper's own open-source code was found.
Open Datasets Yes We use three different machine translation datasets to perform our experiments, all of which were taken from their Tensor2Tensor implementations. The first is WMT English-German... The second translation dataset is WMT En-Fr... The final translation dataset is WMT English-Czech (En-Cs)... For language modeling we used the 1 Billion Word Language Model Benchmark (LM1B) (Chelba et al., 2013)...
Dataset Splits Yes using newstest2013 for development and test on newstest2014. ... We train on the 36 million sentence pairs of WMT 14 En-Fr, validate on newstest2013 and test on newstest2014. ... We used the WMT 18 training dataset, again without Para Crawl, and used newstest2013 and newstest2014 as validation and test sets.
Hardware Specification Yes when using a single Google TPU V.2 chip, as we do in our search. Each search we describe was run 3 times and the top model from each run was retrained on a single TPU V.2 chip for 300K steps. All of the architecture searches we describe were run on WMT 14 En-De. They utilized the search space and tournament selection evolution algorithm described in our Methods section. Unless otherwise noted, each search used 200 workers, which were equipped with a single Google TPU V.2 chip for training and evaluation. Table 3 shows the results of these experiments run on the same 8 NVIDIA P100 hardware setup that was used by Vaswani et al. (2017). Upgrading to 16 TPU V.2 chips, we doubled the number of synchronous workers for these experiments...
Software Dependencies No All of our experiments used Tensor2Tensor’s Transformer TPU hyperparameter settings... but modified to use the memory-efficient Adafactor (Shazeer & Stern, 2018) optimizer. (While software tools like Tensor2Tensor and Adafactor are mentioned, no specific version numbers for these software dependencies are provided.)
Experiment Setup Yes to train a Transformer to peak performance on WMT 14 En-De requires 300K training steps, or 10 hours, in the base size when using a single Google TPU V.2 chip, as we do in our search. We maintained a population of size 100 with subpopulation sizes for both killing and reproducing set to 30. Mutations were applied independently per encoding field at a rate of 2.5%. We found that the our search candidate models, the Transformer, and the Evolved Transformer all benefited from this and so experimented with using linear decay, single-cycle cosine decay (Loshchilov & Hutter, 2017) and a modified inverse-square-root decay to 0 at 300K steps: lr = step 0.00303926 .962392. For ET and all search child models, dropout was applied uniformly after each layer... For En-De and En-Cs, all big and deep sized models were given a higher dropout rate of 0.3... and all other models with an input embedding size of 768 are given a dropout rate of 0.2. For decoding we used the same beam decoding configuration used by Vaswani et al. (2017). That is a beam size of 4, length penalty (α) of 0.6, and maximum output length of input length + 50.