The Evolved Transformer
Authors: David So, Quoc Le, Chen Liang
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The architecture found in our experiments the Evolved Transformer demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT 14 English German; at smaller sizes, it achieves the same quality as the original big Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters. |
| Researcher Affiliation | Industry | David R. So 1 Chen Liang 1 Quoc V. Le 1 1Google Research, Brain Team, Mountain View, California, USA. Correspondence to: David R. So <davidso@google.com>. |
| Pseudocode | Yes | Algorithm 1 (Supplementary Materials) formalizes how the fitness of an individual model is calculated with hurdles and Algorithm 2 (Supplementary Materials) describes tournament selection augmented with Progressive Dynamic Hurdles. |
| Open Source Code | No | The paper references `https://github.com/tensorflow/tensor2tensor` for Tensor2Tensor implementations and hyperparameter settings, which is a third-party tool used, not the source code for the Evolved Transformer developed in this paper. No explicit statement or link for the paper's own open-source code was found. |
| Open Datasets | Yes | We use three different machine translation datasets to perform our experiments, all of which were taken from their Tensor2Tensor implementations. The first is WMT English-German... The second translation dataset is WMT En-Fr... The final translation dataset is WMT English-Czech (En-Cs)... For language modeling we used the 1 Billion Word Language Model Benchmark (LM1B) (Chelba et al., 2013)... |
| Dataset Splits | Yes | using newstest2013 for development and test on newstest2014. ... We train on the 36 million sentence pairs of WMT 14 En-Fr, validate on newstest2013 and test on newstest2014. ... We used the WMT 18 training dataset, again without Para Crawl, and used newstest2013 and newstest2014 as validation and test sets. |
| Hardware Specification | Yes | when using a single Google TPU V.2 chip, as we do in our search. Each search we describe was run 3 times and the top model from each run was retrained on a single TPU V.2 chip for 300K steps. All of the architecture searches we describe were run on WMT 14 En-De. They utilized the search space and tournament selection evolution algorithm described in our Methods section. Unless otherwise noted, each search used 200 workers, which were equipped with a single Google TPU V.2 chip for training and evaluation. Table 3 shows the results of these experiments run on the same 8 NVIDIA P100 hardware setup that was used by Vaswani et al. (2017). Upgrading to 16 TPU V.2 chips, we doubled the number of synchronous workers for these experiments... |
| Software Dependencies | No | All of our experiments used Tensor2Tensor’s Transformer TPU hyperparameter settings... but modified to use the memory-efficient Adafactor (Shazeer & Stern, 2018) optimizer. (While software tools like Tensor2Tensor and Adafactor are mentioned, no specific version numbers for these software dependencies are provided.) |
| Experiment Setup | Yes | to train a Transformer to peak performance on WMT 14 En-De requires 300K training steps, or 10 hours, in the base size when using a single Google TPU V.2 chip, as we do in our search. We maintained a population of size 100 with subpopulation sizes for both killing and reproducing set to 30. Mutations were applied independently per encoding field at a rate of 2.5%. We found that the our search candidate models, the Transformer, and the Evolved Transformer all benefited from this and so experimented with using linear decay, single-cycle cosine decay (Loshchilov & Hutter, 2017) and a modified inverse-square-root decay to 0 at 300K steps: lr = step 0.00303926 .962392. For ET and all search child models, dropout was applied uniformly after each layer... For En-De and En-Cs, all big and deep sized models were given a higher dropout rate of 0.3... and all other models with an input embedding size of 768 are given a dropout rate of 0.2. For decoding we used the same beam decoding configuration used by Vaswani et al. (2017). That is a beam size of 4, length penalty (α) of 0.6, and maximum output length of input length + 50. |