Attention is All you Need
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. |
| Researcher Affiliation | Collaboration | Ashish Vaswani Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomez University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhin illia.polosukhin@gmail.com |
| Pseudocode | No | No clearly labeled pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | The code we used to train and evaluate our models is available at https://github.com/ tensorflow/tensor2tensor. |
| Open Datasets | Yes | We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. [...] For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [31]. |
| Dataset Splits | Yes | We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. [...] On the WMT 2014 English-to-German translation task... [...] measuring the change in performance on English-to-German translation on the development set, newstest2013. |
| Hardware Specification | Yes | We trained our models on one machine with 8 NVIDIA P100 GPUs. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and providing code on TensorFlow, but does not specify version numbers for any software components or libraries. |
| Experiment Setup | Yes | We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. [...] We used the Adam optimizer [17] with β1 = 0.9, β2 = 0.98 and ϵ = 10 9. We varied the learning rate over the course of training, according to the formula: lrate = d 0.5 model min(step_num 0.5, step_num warmup_steps 1.5) [...] We used warmup_steps = 4000. [...] We apply dropout [27] to the output of each sub-layer... For the base model, we use a rate of Pdrop = 0.1. [...] we employed label smoothing of value ϵls = 0.1 [30]. |